# Week 6: Multiple and Logistic Regression
Our goal this week is twofold: 1.To look at one target (dependent) variable in relation to multiple input variables. We will use Multiple Regression for this purpose. And 2. To look at a binary target variable such as creditworthy/ not creditworthy or tumor benign/ tumor malignant in relationship to various independent variables. We will use Logistic Regression for that.

## Dataset
For this purpose, we will use the insurance prediction dataset posted at http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/insurance.csv

The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value desginated for each level.

The purposes of this exercise is to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.

In [2]:
import numpy as np
import pandas as pd

dataset = pd.read_csv('http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/insurance.csv')
print('Dataset Shape', dataset.shape)
dataset.head()

Dataset Shape (1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age         1338 non-null int64
sex         1338 non-null object
bmi         1338 non-null float64
children    1338 non-null int64
smoker      1338 non-null object
region      1338 non-null object
expenses    1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB


## 1. Multiple Regression
Let's set up our model

In [20]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 6].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 1] = labelencoder.fit_transform(X[:, 1])
X[:, 4] = labelencoder.fit_transform(X[:, 4])
X[:, 5] = labelencoder.fit_transform(X[:, 5])
onehotencoder = OneHotEncoder(categorical_features = [5])
X = onehotencoder.fit_transform(X).toarray()

print('X', X.shape)
print('Y', y.shape)

X (1338, 9)
Y (1338,)


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [21]:
# Building the optimal model using Backward Elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((1338, 1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0, 1, 2, 3, 4, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0, 2, 3, 4, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0, 2, 4, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0, 4, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.048
Model:,OLS,Adj. R-squared:,0.045
Method:,Least Squares,F-statistic:,16.73
Date:,"Tue, 12 Feb 2019",Prob (F-statistic):,2.18e-13
Time:,22:35:56,Log-Likelihood:,-14445.0
No. Observations:,1338,AIC:,28900.0
Df Residuals:,1333,BIC:,28930.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,377.9836,1710.273,0.221,0.825,-2977.136,3733.104
x1,-1219.5762,754.673,-1.616,0.106,-2700.051,260.899
x2,1136.1978,647.912,1.754,0.080,-134.841,2407.236
x3,387.8079,53.136,7.298,0.000,283.569,492.046
x4,659.7022,268.614,2.456,0.014,132.750,1186.655

0,1,2,3
Omnibus:,261.546,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,432.726
Skew:,1.296,Prob(JB):,1.0799999999999999e-94
Kurtosis:,4.022,Cond. No.,166.0


In [8]:
X_opt = X[:, [0, 2, 3, 4, 6, 7, 8]]

### Now we are setting up a training and a test set with Scikit Learn
To see more about model_selection, check the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_opt, y, test_size = 0.2, random_state = 0)

In [13]:
print('X Train', X_train.shape)
print('X Test', X_test.shape)

X Train (1070, 7)
X Test (268, 7)


In [22]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

print('Model accuracy score:', round(regressor.score(X_train,y_train)*100,2))

Model accuracy score: 4.73


In [24]:
y_pred = regressor.predict(X_test)

### Prediction

In [38]:
y_test=regressor.predict(X_test)

charges_pred = pd.DataFrame({'Charges Prediction':y_test}).round(2)
print(charges_pred.shape)
charges_pred.head(25)

(268, 1)


Unnamed: 0,Charges Prediction
0,11940.21
1,12967.99
2,17559.85
3,15132.05
4,7439.54
5,10398.09
6,11338.04
7,16219.6
8,14095.74
9,15142.63


## 2. Logistic Regression
This will be on a similar insurance dataset. We are setting the dataframe up from scratch again first, though.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression\

df = pd.read_csv("https://raw.githubusercontent.com/gauravsb14/LogisticRegression/master/insurance.csv")
df.head()


Unnamed: 0,age,insurance status
0,20,0
1,25,0
2,47,1
3,52,0
4,46,1


In [18]:
df.shape

(27, 2)

In [11]:
X = df[['age']]
Y = df[['insurance status']]

train_x,test_x,train_y,test_y = train_test_split(X,Y,test_size = 0.1)

In [40]:
train_x

Unnamed: 0,age
17,58
10,18
21,26
20,21
3,52
8,62
23,45
14,49
16,25
12,28


In [10]:
test_x

Unnamed: 0,age
19,18
21,26
10,18


In [41]:
model = LogisticRegression()

model.fit(train_x,train_y)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [42]:
pred = model.predict(test_x)

print(pred)

[0 1 0]


In [43]:
print(test_y)

    insurance status
13                 0
24                 1
0                  0
