# Multiple Logistic regression

You are working for a new multi-vitamin manufacturing drug company. This company wants to enter the market by producing a new product. They have collected a dataset containing currently available product types in the market and the concentration of various elements associated with them. They have asked you to predict (or classify) the products given these attributes. 

The relevant dataset is given in vitamin_product.xlsx. It contains the following columns:

- Na: Concentration of sodium in the product.
- K: Concentration of potassium in the product.
- Ca: Concentration of calcium in the product.
- Fe: Concentration of iron in the product.
- Product: the type of product

Here's a template that assigns several tasks for you and also provides a roadmap to guide you. Please note, you are free to follow your own coding style and name the variables the way you like.

## 1. Load relevant libraries

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## 2. Load the data

In [45]:
raw_data = pd.read_excel('vitamin_product_additional.xlsx')
raw_data.head()

Unnamed: 0,Na,Al,K,Ca,Ba,Fe,Product
0,13.64,1.1,0.06,8.75,0.0,0.0,vit_a
1,13.89,1.36,0.48,7.83,0.0,0.0,vit_a
2,13.53,1.54,0.39,7.78,0.0,0.0,vit_a
3,13.21,1.29,0.57,8.22,0.0,0.0,vit_a
4,13.27,1.24,0.55,8.07,0.0,0.0,vit_a


## 3. Explore descriptive statistics

In [46]:
raw_data.describe(include='all')

Unnamed: 0,Na,Al,K,Ca,Ba,Fe,Product
count,175.0,175.0,175.0,175.0,175.0,175.0,175
unique,,,,,,,3
top,,,,,,,vit_b
freq,,,,,,,76
mean,13.3844,1.428857,0.459143,8.866629,0.199257,0.059657,
std,0.769317,0.46206,0.338574,1.385399,0.521649,0.09331,
min,10.73,0.29,0.0,5.43,0.0,0.0,
25%,12.885,1.185,0.145,8.21,0.0,0.0,
50%,13.25,1.35,0.56,8.55,0.0,0.0,
75%,13.79,1.615,0.61,9.025,0.0,0.11,


## 4. Check for multicollinearity

In [47]:
raw_data.corr(method='pearson')

Unnamed: 0,Na,Al,K,Ca,Ba,Fe
Na,1.0,0.339216,-0.424251,-0.272647,0.398665,-0.298587
Al,0.339216,1.0,-0.008492,-0.286453,0.508297,-0.092038
K,-0.424251,-0.008492,1.0,-0.403987,-0.137863,0.074467
Ca,-0.272647,-0.286453,-0.403987,1.0,-0.057533,0.165338
Ba,0.398665,0.508297,-0.137863,-0.057533,1.0,-0.083682
Fe,-0.298587,-0.092038,0.074467,0.165338,-0.083682,1.0


In [48]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = raw_data.drop(['Product'], axis=1)

vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif["features"] = variables.columns

In [49]:
vif

Unnamed: 0,VIF,features
0,62.734376,Na
1,15.70426,Al
2,2.930039,K
3,38.015526,Ca
4,1.552807,Ba
5,1.517063,Fe


In [54]:
data = raw_data.drop(['Na'], axis=1)

In [55]:
data.corr(method='pearson')

Unnamed: 0,Al,K,Ca,Ba,Fe
Al,1.0,-0.008492,-0.286453,0.508297,-0.092038
K,-0.008492,1.0,-0.403987,-0.137863,0.074467
Ca,-0.286453,-0.403987,1.0,-0.057533,0.165338
Ba,0.508297,-0.137863,-0.057533,1.0,-0.083682
Fe,-0.092038,0.074467,0.165338,-0.083682,1.0


In [56]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = data.drop(['Product'], axis=1)

vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif["features"] = variables.columns

In [57]:
vif

Unnamed: 0,VIF,features
0,10.310297,Al
1,2.700353,K
2,8.652152,Ca
3,1.546809,Ba
4,1.474602,Fe


## 5. Logistic regression

### 5.1 Declare inputs and targets

In [58]:
targets = data['Product']
inputs = data.drop(['Product'], axis=1)

### 5.2 Perform feature scaling

In [59]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(inputs)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [60]:
inputs_scaled = scaler.transform(inputs)

### 5.3 Divide the dataset into training and testing subsets

In [61]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets, test_size=0.2, random_state=10)

### 5.4 Build regression model

In [83]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(C=10, solver='lbfgs')
log_reg.fit(x_train,y_train)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [84]:
# Coefficients
log_reg.coef_

array([[-2.72071616, -0.28608463, -1.49079017, -0.596777  , -0.39240852],
       [ 1.46099882, -0.07487252,  0.70631819, -5.66082347,  0.46419863],
       [ 2.25487843,  1.08521868,  1.36532489,  3.43313585, -1.63534777]])

In [85]:
# Intercepts
log_reg.intercept_

array([-1.19475323, -1.6234654 , -4.05427473])

In [86]:
# Classes
log_reg.classes_

array(['vit_a', 'vit_b', 'vit_c'], dtype=object)

- Why are there 3 rows and 3 columns in the coefficient array?
- Why are there 3 intercept values?

Hint: Look at the logit functional form. What does it represent?

### 5.5 Calculate the accuracy of regression on the testing dataset using the confusion matrix

In [87]:
y_test_pred = log_reg.predict(x_test)
y_test_pred

array(['vit_a', 'vit_b', 'vit_b', 'vit_a', 'vit_b', 'vit_b', 'vit_c',
       'vit_a', 'vit_a', 'vit_b', 'vit_b', 'vit_c', 'vit_c', 'vit_a',
       'vit_c', 'vit_a', 'vit_a', 'vit_b', 'vit_c', 'vit_c', 'vit_a',
       'vit_b', 'vit_a', 'vit_b', 'vit_a', 'vit_a', 'vit_a', 'vit_b',
       'vit_a', 'vit_a', 'vit_a', 'vit_b', 'vit_a', 'vit_c', 'vit_c'], dtype=object)

In [88]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test,y_test_pred)

print(confusion_matrix)

[[11  4  0]
 [ 5  7  1]
 [ 0  0  7]]


In [89]:
accuracy = confusion_matrix.trace() / confusion_matrix.sum() * 100
accuracy

71.428571428571431

## Additional tasks:

- Try removing Ca instead of Na to check if accuracy improves or not.
- Your company has now given you additional features. Use them to see if accuracy increases. For this, use the file "vitamin_product_additional.xlsx"
- Try using lower values of C in the LogisticRegression class. This will reduce overfitting.