# Binary Logistic Regression

#### What is a Regression?
#### Regression is form of predictive modelling technique which investigates the relationship between dependent (target) variables and independent variables.
#### For Example relationship between rash driving and number of road accidents by a driver is best studied through Regression.
## Logistic Regression
#### Logistic Regression is used to find the probability of event of success and event of failure. Use logistic regression when the dependent variable is binary (0/1, True/False) in nature. Here the output ranges from 0 to 1.
Main things to keep in the mind are
1. Used in classification problems.
2. Doesn't require linear relationship between dependent & independent variables.
3. The independent variables should not be corrrelated with each other i.e., no multi collineraity. However we have options to include the interaction effects of categorical variable in the analysis and in the model.
4. If the values of dependent variable are ordinal then it called as ordinal logistic regression.
5. If the dependent variable is multi class then it known as multinominal logistic regression.

###### We will work on the titanic dataset and explain, you below step by step as the using different variables and exploring data 

In [36]:
#First basic libraries for the exploring the data
import pandas as pd
import numpy as np
#Second libraries for the ploting the graph
import matplotlib.pyplot as plt
#Third libraries for the Logistic Regression
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
#Fourth  provides a high-level interface for drawing attractive and informative statistical graphics.
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

#### Read the data from CSV file and we will explore our data

In [67]:
data = pd.read_csv("titanic3.csv")

In [68]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
pclass       1309 non-null float64
survived     1309 non-null float64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null float64
parch        1309 non-null float64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(7), object(7)
memory usage: 143.4+ KB


#### We can see that there are 14 columns and 1310 row. When we check each individual variable we do not have complete 1310 rows filled.So let us check how many are empty. 

In [69]:
data.isnull().sum()

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

### We can see that all variables are not filled. But when we see that number of empty places in few variables like cabin, boat, body and home.dest are many. So we do not have complete information for this variables. Let us see in this data what can we fill.

In [70]:
data['pclass'].unique()

array([ 1.,  2.,  3., nan])

#### We can see that this variable has only three kinds of values so we can classify this in the categorical. Same way we will classify other variables into continous and categorical. We will fill values for categorical with most appearing and for continous we will fill the mean value.

In [71]:
data['pclass'].value_counts()

3.0    709
1.0    323
2.0    277
Name: pclass, dtype: int64

In [72]:
data['pclass'].fillna(3.0,inplace=True)

data['name'].unique()

In [73]:
data['name'].unique()

array(['Allen, Miss. Elisabeth Walton', 'Allison, Master. Hudson Trevor',
       'Allison, Miss. Helen Loraine', ..., 'Zakarian, Mr. Ortin',
       'Zimmerman, Mr. Leo', nan], dtype=object)

There are lot of unique values so we cannot consider this for summarization of the given data. 

In [74]:
data['sex'].value_counts()

male      843
female    466
Name: sex, dtype: int64

In [75]:
data['sex'].fillna('male',inplace=True)

In [76]:
data['age'].fillna(data['age'].mean(),inplace=True)

In [77]:
data['sibsp'].value_counts()

0.0    891
1.0    319
2.0     42
4.0     22
3.0     20
8.0      9
5.0      6
Name: sibsp, dtype: int64

In [78]:
data['sibsp'].fillna(0.0,inplace=True)

In [79]:
data['parch'].value_counts()

0.0    1002
1.0     170
2.0     113
3.0       8
5.0       6
4.0       6
9.0       2
6.0       2
Name: parch, dtype: int64

In [80]:
data['parch'].fillna(0.0,inplace=True)

In [10]:
data['fare'].mean()

33.29547928134572

In [81]:
data['fare'].fillna(data['fare'].mean(),inplace=True)

### We can see that cabin data not making that great sense in generalizing the data

In [82]:
data['embarked'].value_counts()

S    914
C    270
Q    123
Name: embarked, dtype: int64

In [83]:
data['embarked'].fillna('S',inplace=True)

In [15]:
data['boat'].isnull().sum()

824

In [84]:
data['age'].unique()

array([29.        ,  0.9167    ,  2.        , 30.        , 25.        ,
       48.        , 63.        , 39.        , 53.        , 71.        ,
       47.        , 18.        , 24.        , 26.        , 80.        ,
       29.88113451, 50.        , 32.        , 36.        , 37.        ,
       42.        , 19.        , 35.        , 28.        , 45.        ,
       40.        , 58.        , 22.        , 41.        , 44.        ,
       59.        , 60.        , 33.        , 17.        , 11.        ,
       14.        , 49.        , 76.        , 46.        , 27.        ,
       64.        , 55.        , 70.        , 38.        , 51.        ,
       31.        ,  4.        , 54.        , 23.        , 43.        ,
       52.        , 16.        , 32.5       , 21.        , 15.        ,
       65.        , 28.5       , 45.5       , 56.        , 13.        ,
       61.        , 34.        ,  6.        , 57.        , 62.        ,
       67.        ,  1.        , 12.        , 20.        ,  0.83

#### This information seems to be very less to generalize the data. 

In [85]:
data['body'].unique()

array([ nan, 135.,  22., 124., 148., 208., 172., 269.,  62., 133., 275.,
       147., 110., 307.,  38.,  80.,  45., 258., 126., 292., 175., 249.,
       230., 122., 263., 234., 189., 166., 207., 232.,  16., 109.,  96.,
        46., 245., 169., 174.,  97.,  18., 130.,  17., 295., 286., 236.,
       322., 297., 155., 305.,  19.,  75.,  35., 256., 149., 283., 165.,
       108., 121.,  52., 209., 271.,  43.,  15., 101., 287.,  81., 294.,
       293., 190.,  72., 103.,  79., 259., 260., 142., 299., 171.,   9.,
       197.,  51., 187.,  68.,  47.,  98., 188.,  69., 306., 120., 143.,
       156., 285.,  37.,  58.,  70., 196., 153.,  61.,  53., 201., 309.,
       181., 173.,  89.,   4., 206., 327., 119.,   7.,  32.,  67., 284.,
       261., 176.,  50.,   1., 255., 298., 314.,  14., 131., 312., 328.,
       304.])

In [19]:
data['body'].isnull().sum()

1189

#### I could have used the body for generalization of data but as the information is very very less as you can see only 200 values are filled so I can avoid this variable.

In [20]:
data['home.dest'].isnull().sum()

565

In [51]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
pclass       1310 non-null float64
survived     1309 non-null float64
name         1309 non-null object
sex          1310 non-null object
age          1310 non-null object
sibsp        1310 non-null float64
parch        1310 non-null float64
ticket       1309 non-null object
fare         1310 non-null float64
cabin        295 non-null object
embarked     1310 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(6), object(8)
memory usage: 143.4+ KB


In [86]:
data['survived'].fillna(1.0,inplace=True)

##### I see that information available in the home dest would not be useful. 

##### So we will be using pclass, survived, age, sex, sibsp, parch, ticket, fare and embarked in our further analysis.

In [87]:
data.drop(data.columns[[2,7,9,11,12,13]],axis=1,inplace=True)

In [88]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 8 columns):
pclass      1310 non-null float64
survived    1310 non-null float64
sex         1310 non-null object
age         1310 non-null float64
sibsp       1310 non-null float64
parch       1310 non-null float64
fare        1310 non-null float64
embarked    1310 non-null object
dtypes: float64(6), object(2)
memory usage: 82.0+ KB


In [89]:
data2 = pd.get_dummies(data, columns =["sex","pclass","sibsp","parch","embarked"])

In [90]:
data2.corr()

Unnamed: 0,survived,age,fare,sex_female,sex_male,pclass_1.0,pclass_2.0,pclass_3.0,sibsp_0.0,sibsp_1.0,...,parch_1.0,parch_2.0,parch_3.0,parch_4.0,parch_5.0,parch_6.0,parch_9.0,embarked_C,embarked_Q,embarked_S
survived,1.0,-0.05016801,0.244057,0.527534,-0.527534,0.278686,0.050251,-0.282269,-0.104933,0.150051,...,0.163523,0.07712,0.039123,-0.030117,-0.030117,-0.030772,-0.030772,0.181498,-0.016373,-0.149789
age,-0.050168,1.0,0.171521,-0.057386,0.057386,0.362541,-0.014192,-0.301996,0.090684,0.045528,...,-0.138778,-0.223422,0.054764,0.077804,0.049806,0.035293,-1.383376e-17,0.076171,-0.012718,-0.059143
fare,0.244057,0.1715206,1.0,0.185445,-0.185445,0.599881,-0.12136,-0.419481,-0.211728,0.169176,...,0.125358,0.166706,0.080954,0.093718,-0.001232,0.01029,0.0274213,0.286212,-0.130049,-0.169866
sex_female,0.527534,-0.05738563,0.185445,1.0,-1.0,0.107659,0.029147,-0.117022,-0.189198,0.172842,...,0.130612,0.163591,0.064557,0.044058,0.044058,0.011784,0.01178443,0.066832,0.088812,-0.115522
sex_male,-0.527534,0.05738563,-0.185445,-1.0,1.0,-0.107659,-0.029147,0.117022,0.189198,-0.172842,...,-0.130612,-0.163591,-0.064557,-0.044058,-0.044058,-0.011784,-0.01178443,-0.066832,-0.088812,0.115522
pclass_1.0,0.278686,0.3625414,0.599881,0.107659,-0.107659,1.0,-0.296232,-0.622295,-0.083348,0.141727,...,0.042605,-0.005437,0.000625,0.013656,-0.038804,-0.022369,-0.02236937,0.325871,-0.165933,-0.182034
pclass_2.0,0.050251,-0.01419151,-0.12136,0.029147,-0.029147,-0.296232,1.0,-0.563305,-0.026525,0.063363,...,0.039238,0.007365,0.031396,-0.035126,-0.035126,-0.020249,-0.02024887,-0.134447,-0.121828,0.196221
pclass_3.0,-0.282269,-0.3019955,-0.419481,-0.117022,0.117022,-0.622295,-0.563305,1.0,0.093842,-0.174535,...,-0.069015,-0.001333,-0.026271,0.016975,0.062357,0.035947,0.03594658,-0.171716,0.243392,-0.003343
sibsp_0.0,-0.104933,0.0906837,-0.211728,-0.189198,0.189198,-0.083348,-0.026525,0.093842,1.0,-0.828806,...,-0.310701,-0.198003,-0.072466,-0.050582,-0.050582,-0.057122,-0.05712226,-0.052013,0.091223,-0.012151
sibsp_1.0,0.150051,0.04552753,0.169176,0.172842,-0.172842,0.141727,0.063363,-0.174535,-0.828806,1.0,...,0.257232,0.009396,0.069672,0.066877,0.066877,0.068921,0.06892117,0.106632,-0.097269,-0.032213


In [91]:
X = data2.iloc[:,1:]
y = data2.iloc[:,0]

In [92]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [93]:
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [94]:
y_pred = classifier.predict(X_test)

In [95]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test.round(), y_pred.round())
print(confusion_matrix)

[[184  24]
 [ 38  82]]


In [96]:
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.81
