# Background

__The veteran dataset contains 44 variables that describes the features for givers and non-givers for a charity campaign. The label that defines a donation is Target_B.__ 


- Target_B = 1: donation
- Target_B = 0: No donation 

__Below is an example how to apply a logistic regression model on the data in order to predict the likelihood for donation. The model is evaulated with a so called ROC curve__


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

from pandas import Series, DataFrame
import matplotlib.pylab as plt 

from sklearn.metrics import roc_curve, auc

from sklearn.metrics import confusion_matrix


__Import data into a dataframe. Look at the 10 first observations__

In [None]:
df = pd.read_csv("veteran_brutto.csv" )

df.head(10)

In [None]:
df.info()

__Deskriptiv statistik__

In [None]:
m_cts = (df['URBANICITY'].value_counts())

# Definierar hela arean
fig = plt.figure(figsize = (10,5))
# Skapar grafobjekt som ska läggas in i arean ovan
ax = fig.add_subplot(1,1,1)

ax.set_title('URBANICITY')


m_cts.plot(kind='bar', rot = 0, grid = False, alpha = 0.6)

plt.show()

In [None]:
m_cts = (df['SES'].value_counts())

# Definierar hela arean
fig = plt.figure(figsize = (10,5))
# Skapar grafobjekt som ska läggas in i arean ovan
ax = fig.add_subplot(1,1,1)

ax.set_title('SES')


m_cts.plot(kind='bar', rot = 0, grid = False, alpha = 0.6)

plt.show()

In [None]:
m_cts = (df['DONOR_GENDER'].value_counts())

# Definierar hela arean
fig = plt.figure(figsize = (10,5))
# Skapar grafobjekt som ska läggas in i arean ovan
ax = fig.add_subplot(1,1,1)

ax.set_title('GENDER')


m_cts.plot(kind='bar', rot = 0, grid = False, alpha = 0.6)

plt.show()

In [None]:
### Väljer enbart observationer om man är man eller kvinna

df_subset = df[(df['DONOR_GENDER'] == 'F') | (df['DONOR_GENDER'] == 'M')].copy()

In [None]:
len(df)

In [None]:
len(df_subset)

In [None]:
# Definierar hela arean
fig = plt.figure(figsize = (10,5))
# Skapar grafobjekt som ska läggas in i arean ovan
ax = fig.add_subplot(1,1,1)

ax.set_title('Age')


plt.hist(df_subset['DONOR_AGE'], bins = 20, alpha = 0.6)

plt.show()

### Nu vill vi göra en klassificeringsmodell för att prediktera sannolikhet för donation. Vill skapa ett dynamiskt flöde för att hantera alla olika datatyper

### Först måste vi separera kategori variabler från numeriska - dessa ska hanteras på olika sätt
* Kategorivariabler - skapa dummy variabler
* Numeriska variabler - imputera missing values

In [None]:
num_attribs = []
cat_attribs = []
for var, typ in zip(df.columns[1:], df.dtypes[1:]):
    if typ == 'object':
        cat_attribs.append(var)
    else:
        num_attribs.append(var)       

In [None]:
num_attribs

In [None]:
cat_attribs

In [None]:
cols = num_attribs + cat_attribs

In [None]:
cols

### Nu vill vi bygga en pipeline som hanterar kategorivariabler, missing values samt utför en logistisk regrssion

### In order to avoid overfitting data is split into training data and validation data. 75% träningsdata och 25% valideringsdata


In [None]:
df_subset['is_train']=np.random.uniform(0,1,len(df_subset))<=0.75

train, validate = df_subset[df_subset['is_train']==True], df_subset[df_subset['is_train']==False]

In [None]:
len(train)

In [None]:
len(validate)

In [None]:
X_train = train[cols].copy()
X_validate = validate[cols].copy()

In [None]:
y_train = train['TARGET_B']
y_validate = validate['TARGET_B']

### Sätter upp en pipeline som gör det möjligt att hantera flödet dynamiskt

Logistic regrssion object

In [None]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

### One-hot encoding object

- This is the same as "dummy encoding"

In [None]:
ohe = OneHotEncoder()

### Impute object for missing values

Missing values:

- Imputation means that you are filling in missing values based on what you know from the non-missing data
- Carefully consider the costs and benefits of imputation before proceeding, because you are making up data

Use SimpleImputer to perform the imputation:

- It requires 2-dimensional input (just like OneHotEncoder)
- By default, it fills missing values with the mean of the non-missing values
- It also supports other imputation strategies: median value, most frequent value, or a user-defined value

In [None]:
imp = SimpleImputer()

__Nu skapar vi ett column transformer objekt som utför one hot encoding på kategorivariabler och imouterar missing värden__ 

In [None]:
ct = make_column_transformer(
    (ohe, cat_attribs),
    (imp, num_attribs)
     )

__Definerar vår pipeline - Denna pipeline utför one-hot encoding samt imoputerar medelvärde för missing numeriska variabler. Dessutom så utför den en logistisk regression som ett sista steg__

In [None]:
pipe = make_pipeline(ct,logreg)

In [None]:
type(pipe)

__Nu kör vi vår pipeline. Fit metoden tar fram parametrarna för datat som kommer ut från vår pipeline__

In [None]:
pipeline = pipe.fit(X_train, y_train)

In [None]:
type(pipeline)

# Train the model on the training data and then evaluate on the validation data

- predicts: A twodimensional array that contains posterrior probabbility for a donation behaviour, one for non-donation and one for donation
- fpr : False positive rate, number of false positive for a specific threshold value
- tpr : True positive rate, number of true positive for a specific threshold value
- threshold: Sorted threshold (descending) values for the likelihod to donate
- roc_auc: Receiver operating characteristics. A value close to 1 indicates a strong model. A value close to 0.5 means that the model is rather poor

In [None]:
# Predikterar med modellen med valideringsdata

predict = pipeline.predict_proba(X_validate)

fpr, tpr, threshold = roc_curve(y_validate,predict[:,1])

roc_auc = auc(fpr,tpr)

In [None]:
type(predict)

In [None]:
len(predict)

In [None]:
predict

# A ROC curve for the charity classifier on the charity data

__It traces out two types of error as we vary the threshold value for the posterior probability of charity. The actual thresholds are not shown. The true positive rate is the sensitivity: the fraction of givers that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-givers that we classify incorrectly as givers, using that same threshold value. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate. The dotted line represents the “no information” classifier.__ 


In [None]:
# Graf AUC

fig = plt.figure(figsize = (10,8))

plt.title('Receiver Operating Characteristic') 
plt.plot( fpr, tpr, 'b', label =' AUC = %0.3f' % roc_auc) 
plt.legend( loc ='lower right') 
plt.plot([ 0, 1], [0, 1], 'r--') 
plt.xlim([ 0.0, 1.0]) 
plt.ylim([ 0.0, 1.0]) 
plt.ylabel('True positive rate') 
plt.xlabel('False positive rate') 
plt.show()