## Logistic Regression

In the real world, we often come across scenarios which requires to make decisions that result into finite outcomes, like the below examples,  
  
Will it rain today?  
Will I reach office on time today?  
Would a child graduate from his/her university?  
Does sedentary lifestyle increase the chances to get the heart disease?  
Does smoking lead to lung cancer?  
Would I wear blue, black, red outfit today?  
What grade a student would get in an exam?  
All the above situations do reflect the input-output relationships. Here the output variable values are discrete & finite rather than continuous & infinite values like in Linear Regression. How could we model and analyze such data?  
    
We could try to frame a rule which helps in guessing the outcome from the input variables. This is called a classification problem, and is an important topic in statistics and machine learning. Classification, a task of assigning objects to one of the several predefined categories, is a pervasive problem that encompasses many diverse applications in a broad array of domains. Some examples of Classification Tasks are listed below:
    
In medical field, the classification task could be assigning a diagnosis to a given patient as described by observed characteristics of the patient such as age, gender, blood pressure, body mass index, presence or absence of certain symptoms, etc.  
In banking sector, one may want to categorize hundreds or thousands of applications for new cards containing information for several attributes such as annual salary, outstanding debts, age etc., into users who have good credit or bad credit for enabling a credit card company to do further analysis for decision making; OR one might want to learn to predict whether a particular credit card charge is legitimate or fraudulent.  
In social sciences, we may be interested to predict the preference of a voter for a party based on – age, income, sex, race, residence state, votes in previous elections etc.  
In finance sector, one would require to ascertain “whether a vendor is credit worthy”?  
In insurance domain, the company will need to assess “Is the submitted claim fraudulent or genuine”?  
In Marketing, the marketer would like to figure out “Which segment of consumers are likely to buy”?  
Mostly, in the business world, Classification problems where the response or dependent variable have discrete and finite outcomes, are more prevalent than the Regression problems where the response variable is continuous and have infinite values. Logistic Regression is one of the most common algorithm used for modeling classification problems.  

* Comes under the category of supervised machine learning
* First we should identify the boundary conditions & then the next task is to predict the target class
* To classify gender (target class) using hair length as feature parameter
* the idea is to predict the target class by analysis the training dataset.
* train your model using any abvailable classification algorithms

<font color="red">**Step 1:**</font> load data and run numerical and graphical summaries

<font color="red">**Step 2:**</font> Split the data into training data and test data

<font color="red">**Step 3:**</font> Fit a model using training data

<font color="red">**Step 3:**</font> Use a fitted model to do predictions for the test data

<font color="red">**Step 4:**</font> Create a confusion matrix, and compute the misclassification rate

![Imgur](https://i.imgur.com/iGNsDyl.jpg)    
<sub>source: <a href="https://www.biomedware.com/files/documentation/spacestat/Statistics/Multivariate_Modeling/Regression/Implementation_of_Logistic_GWR.htm
" target="_blank">https://www.biomedware.com/files/documentation/spacestat/Statistics/Multivariate_Modeling/Regression/Implementation_of_Logistic_GWR.htm
</a></sub>  

![Imgur](https://i.imgur.com/WayfUZH.png)    
<sub>source: <a href="https://www.saedsayad.com/logistic_regression.htm
" target="_blank">https://www.saedsayad.com/logistic_regression.htm
</a></sub>  

### Load the packages

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
 
pd.set_option("display.max_rows",10)
pd.set_option("display.max_columns",101)
 
%matplotlib inline

# For the evaluation
from sklearn.metrics import *

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  from pandas.core import datetools
  'Matplotlib is building the font cache using fc-list. '
  return f(*args, **kwds)
  return f(*args, **kwds)


### Load the data

In [2]:
df = pd.read_csv('data/Titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Data cleaning and manipulation and  Exploration

we are going to drop ‘PassengerId’, ‘Name’ and ‘Ticket’, since those variables not influence the risk of survival

In [3]:
df = df.drop(['PassengerId', 'Name', 'Ticket'], axis=1)

In [4]:
df.loc[df.Cabin.isnull()]
df['HasCabin'] = np.nan
df.loc[df.Cabin.notnull(), 'HasCabin'] = 1
df.loc[df.Cabin.isnull(), 'HasCabin'] = 0
df.HasCabin.value_counts()
df = df.drop(['Cabin'], axis=1)
df
embarked = pd.get_dummies(df.Embarked, prefix='Embarked_')
embarked
sex = pd.get_dummies(df.Sex, prefix='Sex_')
sex
df = df.join(embarked)
df = df.join(sex)
df
vars_to_drop = ['Sex','Embarked']
df = df.drop(vars_to_drop, axis=1)
df = df.dropna(axis=0)
df
df.to_csv('data/Titanic_Clean.csv', index=False)

### Model Building

In [33]:
df = pd.read_csv('data/Titanic_Clean.csv')
df
vars_to_drop = ['Sex__male','Embarked__C']
df = df.drop(vars_to_drop, axis=1)
df
df['_intercept'] = 1
# Copy df across and drop Survived
x = df
x = x.drop('Survived', axis=1)
 
# Set y as the survived column, we need
# to wrap it in the dataframe to stop it
# being series 
y = pd.DataFrame(df.Survived)
# Make the model
logit = sm.Logit(y, x)
 
# Fit the model
result = logit.fit()
print (result.summary())

Optimization terminated successfully.
         Current function value: 0.441075
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      704
Method:                           MLE   Df Model:                            9
Date:                Wed, 25 Jul 2018   Pseudo R-squ.:                  0.3470
Time:                        12:39:22   Log-Likelihood:                -314.93
converged:                       True   LL-Null:                       -482.26
                                        LLR p-value:                 1.136e-66
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Pclass         -1.0132      0.197     -5.136      0.000      -1.400      -0.627
Age            -0.0435    

### Model Evaluation

In [34]:
pred = result.predict(x)
pred

0      0.092604
1      0.936340
2      0.636565
3      0.918227
4      0.077030
         ...   
709    0.315290
710    0.246331
711    0.969483
712    0.708212
713    0.057275
Length: 714, dtype: float64

In [35]:
confusion_matrix(y, np.round(pred,0))

array([[363,  61],
       [ 80, 210]])

**Summarize:**  
    
* 210 people that where predicted to survive actually survived (Correct classification)
* 363 people that where predicted to not survive actually didn’t survive (Correct classification)
* 61 people where predicted to survive but actually died (False Positive / Type 1 error)
* 80 people where predicted to die but actually survived (False Negative / Type 2 error)

## Decision Tree and Random Forest

https://www.datacamp.com/community/news/how-decision-tree-model-is-different-from-random-forest-d9dyn7jqv4w  
https://engmrk.com/module-13-decision-tree-and-random-forest-2/?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com

## Support Vector Machines

**muffin_vs_cupcake_demo.pdf:**    
https://notebooks.azure.com/sumendar/libraries/StatsDSMLwithPython-june18/html/10-scikitLearn/Classification/docs/  

<span style="color:red; font-family:Comic Sans MS">Sources & References: </span>     
<a href="https://clevertap.com/blog/a-primer-on-logistic-regression-part-i/ " target="_blank">https://clevertap.com/blog/a-primer-on-logistic-regression-part-i/ </a>  
<a href="http://stamfordresearch.com/titanic-2-logistic-regression/" target="_blank">http://stamfordresearch.com/titanic-2-logistic-regression/</a>  
<a href="https://github.com/adashofdata/muffin-cupcake" target="_blank">https://github.com/adashofdata/muffin-cupcake</a>  
