# Linear regression

In [1]:
# you will need to have scikit-learn version >= 0.20
from sklearn.datasets import fetch_openml

bodyfat = fetch_openml('bodyfat', version=1)

In [2]:
print(bodyfat.DESCR[:613])

**Author**: Roger W. Johnson  
**Source**: [UCI (not available anymore)](https://archive.ics.uci.edu/ml/index.php), [TunedIT](http://tunedit.org/repo/UCI/numeric/bodyfat.arff)  
**Please cite**: None. 

Short Summary:
Lists estimates of the percentage of body fat determined by underwater
weighing and various body circumference measurements for 252 men.

Classroom use of this data set:
This data set can be used to illustrate multiple regression techniques.
Accurate measurement of body fat is inconvenient/costly and it is
desirable to have easy methods of estimating body fat that are not
inconvenient/costly.


In [3]:
import numpy as np
import pandas as pd

fat_df = pd.DataFrame(bodyfat.data, columns=bodyfat.feature_names)
fat_df.head()

Unnamed: 0,Density,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
0,1.0708,23.0,154.25,67.75,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
1,1.0853,22.0,173.25,72.25,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
2,1.0414,22.0,154.0,66.25,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
3,1.0751,26.0,184.75,72.25,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2
4,1.034,24.0,184.25,71.25,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7


You can further description of the data here: https://www.openml.org/d/560

The predicted (dependent) variable is `Density`, other attributes can be used as predictors.

## Exercise 1

Explore the data set - are there any correlated attributes? Which attributes correlate with the variable `Density`? Visualize important dependencies in the data.

In [4]:
# TODO

## Exercise 2

Split the data into train and test sets.

In [5]:
# TODO

## Exercise 3

1. Train multiple simple linear regression models (i.e., using only one attribute) using both `statsmodels` and `scikit-learn` libraries. 
2. Examine the trained models using output of `statsmdels`. Which model better explains the variance in the data ($R^2$)?
3. Visualize models using combination of a scatter plot and a line plot.
4. Compare the trained models on a test set using MSE or RMSE metric.

In [6]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression

In [7]:
# TODO

## Exercise 4

Train and compare models of multiple linear regression - manually select a subset of attributes or create a model with all attributes (do not forget to normalize the attributes first). Which attributes are good predictors? Is there a statistically significant relationship between predictors and the predicted variable? How can we interpret the coefficients of the model?

In [8]:
# TODO

# Logistic regression

In [9]:
from sklearn import datasets

wine = datasets.load_wine()

In [10]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [11]:
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [12]:
target = wine.target
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

## Exercise 5

Your task is to classify wines into one of three classes using logistic regression.

1. Follow the same steps as in the exercises 1-4. Use both `scikit-learn` and `statsmodels` libraries. What metrics are used to evaluate classification?
2. Logistic regression is by default a binary classifier. What strategy is used in the `scikit-learn` library for the logistic regression to be able to predict more than two classes? For more info on this, look at: https://scikit-learn.org/stable/modules/multiclass.html

In [13]:
# TODO