# Exploration of Prudential Life Insurance Data

### Data retrieved from: 
https://www.kaggle.com/c/prudential-life-insurance-assessment


###### File descriptions:

* train.csv - the training set, contains the Response values
* test.csv - the test set, you must predict the Response variable for all rows in this file
* sample_submission.csv - a sample submission file in the correct format

###### Data fields:

Variable |	Description
-------- |  ------------
Id       |	A unique identifier associated with an application.
Product_Info_1-7 |	A set of normalized variables relating to the product applied for
Ins_Age |	Normalized age of applicant
Ht |	Normalized height of applicant
Wt |	Normalized weight of applicant
BMI |	Normalized BMI of applicant
Employment_Info_1-6 |	A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6 |	A set of normalized variables providing information about the applicant.
Insurance_History_1-9 |	A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5 |	A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41 |	A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48 |	A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response |	This is the target variable, an ordinal variable relating to the final decision associated with an application


The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

My thoughts are as follows:

The main dependent variable is the Risk Response (1-8)
What are variables are correlated to the risk response?
How do I perform correlation analysis between variables?

#Import libraries

In [2]:
# Importing libraries

%pylab inline
%matplotlib inline
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import preprocessing
import numpy as np

Populating the interactive namespace from numpy and matplotlib


In [3]:
# Convert variable data into categorical, continuous, discrete, 
# and dummy variable lists the following into a dictionary

#Seperation of columns into categorical, continous and discrete

In [4]:
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
    "Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
     "Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]
 

varTypes = dict()


varTypes['categorical'] = s[0].split(', ')


varTypes['continuous'] = s[1].split(', ')


varTypes['discrete'] = s[2].split(', ')



varTypes['dummy'] = ["Medical_Keyword_"+str(i) for i in range(1,49)]


In [5]:
#Prints out each of the the variable types as a check
#for i in iter(varTypes['dummy']):
    #print i

#Importing life insurance data set

In [6]:
#Import training data 
d_raw = pd.read_csv('prud_files/train.csv')
d = d_raw.copy()

In [9]:
len(d.columns)

128

# Pre-processing raw dataset for NaN values

In [181]:
# Get all the columns that have NaNs
d = d_raw.copy()
a = pd.isnull(d).sum()
nullColumns = a[a>0].index.values

#for c in nullColumns:
    #d[c].fillna(-1)

#Determine the min and max values for the NaN columns
a = pd.DataFrame(d, columns=nullColumns).describe()
a_min = a[3:4]
a_max = a[7:8]





array(['Employment_Info_1', 'Employment_Info_4', 'Employment_Info_6',
       'Insurance_History_5', 'Family_Hist_2', 'Family_Hist_3',
       'Family_Hist_4', 'Family_Hist_5', 'Medical_History_1',
       'Medical_History_10', 'Medical_History_15', 'Medical_History_24',
       'Medical_History_32'], dtype=object)

In [175]:
nullList = ['Family_Hist_4',
 'Medical_History_1',
 'Medical_History_10',
 'Medical_History_15',
 'Medical_History_24',
 'Medical_History_32']

pd.DataFrame(a_max, columns=nullList)

Unnamed: 0,Family_Hist_4,Medical_History_1,Medical_History_10,Medical_History_15,Medical_History_24,Medical_History_32
max,0.943662,240,240,240,240,240


In [303]:
# Convert all NaNs to -1 and sum up all medical keywords across columns

df = d.fillna(-1)
b = pd.DataFrame(df[varTypes["dummy"]].sum(axis=1), columns=["Medical_Keyword_Sum"])
df= pd.concat([df,b], axis=1, join='outer')

Unnamed: 0,Id,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,...,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response,Medical_Keyword_Sum
0,2,1,D3,10,0.076923,2,1,1,0.641791,0.581818,...,0,0,0,0,0,0,0,0,8,0
1,5,1,A1,26,0.076923,2,3,1,0.059701,0.600000,...,0,0,0,0,0,0,0,0,4,0
2,6,1,E1,26,0.076923,2,3,1,0.029851,0.745455,...,0,0,0,0,0,0,0,0,8,0
3,7,1,D4,10,0.487179,2,3,1,0.164179,0.672727,...,0,0,0,0,0,0,0,0,8,1
4,8,1,D2,26,0.230769,2,3,1,0.417910,0.654545,...,0,0,0,0,0,0,0,0,8,0
5,10,1,D2,26,0.230769,3,1,1,0.507463,0.836364,...,0,0,0,0,0,0,0,0,8,2
6,11,1,A8,10,0.166194,2,3,1,0.373134,0.581818,...,0,0,0,0,0,0,0,0,8,0
7,14,1,D2,26,0.076923,2,3,1,0.611940,0.781818,...,0,0,0,0,0,0,0,0,1,0
8,15,1,D3,26,0.230769,2,3,1,0.522388,0.618182,...,0,0,0,0,0,0,0,0,8,1
9,16,1,E1,21,0.076923,2,3,1,0.552239,0.600000,...,0,0,0,0,0,0,0,0,1,2


# Create or import the test data set

In [328]:
#Turn split train to test on or off.  

#If on, 10% of the dataset is used for feature training
#If off, training set is loaded from file

splitTrainToTest = 1

if(splitTrainToTest):
    
    d_gb = df.groupby("Response")
    
    df_test = pd.DataFrame()
    
    for name, group in d_gb:
        df_test = pd.concat([df_test, group[:len(group)/10]], axis=0, join='outer')
    print "test data is 10% training data"
    
else:
    d_test = pd.read_csv('prud_files/test.csv')
    df_test = d_test.fillna(-1)
    b = pd.DataFrame(df[varTypes["dummy"]].sum(axis=1), columns=["Medical_Keyword_Sum"])
    df_test= pd.concat([df_test,b], axis=1, join='outer')
    print "test data is prud_files/test.csv"




test data is 10% training data


# Data transformation and extraction 

##Data groupings

In [275]:
df_cat = df[["Id","Response"]+varTypes["categorical"]]
df_disc = df[["Id","Response"]+varTypes["discrete"]]
df_cont = df[["Id","Response"]+varTypes["continuous"]]
df_dummy = df[["Id","Response"]+varTypes["dummy"]]

df_cat_test = df_test[["Id","Response"]+varTypes["categorical"]]
df_disc_test = df_test[["Id","Response"]+varTypes["discrete"]]
df_cont_test = df_test[["Id","Response"]+varTypes["continuous"]]
df_dummy_test = df_test[["Id","Response"]+varTypes["dummy"]]

In [355]:
## Extract categories of each column

df_n = df[["Response", "Medical_Keyword_Sum"]+varTypes["categorical"]+varTypes["discrete"]+varTypes["continuous"]].copy()
df_test_n = df_test[["Response","Medical_Keyword_Sum"]+varTypes["categorical"]+varTypes["discrete"]+varTypes["continuous"]].copy()

In [356]:
#Get all the Product Info 2 categories
a = pd.get_dummies(df["Product_Info_2"]).columns.tolist()

norm_PI2_dict = dict()

#Create an enumerated dictionary of Product Info 2 categories
i=1
for c in a:
    norm_PI2_dict.update({c:i})
    i+=1 

print norm_PI2_dict

df_n = df_n.replace(to_replace={'Product_Info_2':norm_PI2_dict})
df_test_n = df_test_n.replace(to_replace={'Product_Info_2':norm_PI2_dict})

df_n

{'B2': 10, 'D1': 15, 'E1': 19, 'D3': 17, 'A1': 1, 'D4': 18, 'A3': 3, 'A2': 2, 'A5': 5, 'A4': 4, 'A7': 7, 'A6': 6, 'C3': 13, 'A8': 8, 'C1': 11, 'C4': 14, 'D2': 16, 'C2': 12, 'B1': 9}


Unnamed: 0,Response,Medical_Keyword_Sum,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_5,Product_Info_6,Product_Info_7,Employment_Info_2,Employment_Info_3,...,Wt,BMI,Employment_Info_1,Employment_Info_4,Employment_Info_6,Insurance_History_5,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5
0,8,0,1,17,10,2,1,1,12,1,...,0.148536,0.323008,0.0280,0.00000,-1.0000,0.000667,-1.000000,0.598039,-1.000000,0.526786
1,4,0,1,1,26,2,3,1,1,3,...,0.131799,0.272288,0.0000,0.00000,0.0018,0.000133,0.188406,-1.000000,0.084507,-1.000000
2,8,0,1,19,26,2,3,1,9,1,...,0.288703,0.428780,0.0300,0.00000,0.0300,-1.000000,0.304348,-1.000000,0.225352,-1.000000
3,8,1,1,18,10,2,3,1,9,1,...,0.205021,0.352438,0.0420,0.00000,0.2000,-1.000000,0.420290,-1.000000,0.352113,-1.000000
4,8,0,1,16,26,2,3,1,9,1,...,0.234310,0.424046,0.0270,0.00000,0.0500,-1.000000,0.463768,-1.000000,0.408451,-1.000000
5,8,2,1,16,26,3,1,1,15,1,...,0.299163,0.364887,0.3250,0.00000,1.0000,0.005000,-1.000000,0.294118,0.507042,-1.000000
6,8,0,1,8,10,2,3,1,1,3,...,0.173640,0.376587,0.1100,-1.00000,0.8000,0.001667,0.594203,-1.000000,0.549296,-1.000000
7,1,0,1,16,26,2,3,1,12,1,...,0.403766,0.571612,0.1200,0.00000,1.0000,0.000667,-1.000000,0.490196,-1.000000,0.633929
8,8,1,1,17,26,2,3,1,9,1,...,0.184100,0.362643,0.1650,0.00000,1.0000,0.007613,-1.000000,0.529412,0.676056,-1.000000
9,1,2,1,19,21,2,3,1,1,3,...,0.284519,0.587796,0.0250,0.00000,0.0500,0.000667,0.797101,-1.000000,-1.000000,0.553571


#Categorical normalization

In [359]:
# normalizes a single dataframe column and returns the result
def normalize_df(d):
    min_max_scaler = preprocessing.MinMaxScaler()
    x = d.values.astype(np.float)
    #return pd.DataFrame(min_max_scaler.fit_transform(x))
    return pd.DataFrame(min_max_scaler.fit_transform(x))


def normalize_cat(d):
    
    for x in varTypes["discrete"]:
            try:
                a = pd.DataFrame(normalize_df(d_disc[x]))
                a.columns=[str("n"+x)]
                d_disc = pd.concat([d_disc, a], axis=1, join='outer')
            except Exception as e:
                print e.args
                print "Error on "+str(x)+" w error: "+str(e)

    return d_disc



def normalize_disc(d_disc):
    
    for x in varTypes["discrete"]:
            try:
                a = pd.DataFrame(normalize_df(d_disc[x]))
                a.columns=[str("n"+x)]
                d_disc = pd.concat([d_disc, a], axis=1, join='outer')
            except Exception as e:
                print e.args
                print "Error on "+str(x)+" w error: "+str(e)

    return d_disc


# t= categorical, discrete, continuous

def normalize_cols(d, t = "categorical"):
    
    for x in varTypes[t]:
            try:
                a = pd.DataFrame(normalize_df(d[x]))
                a.columns=[str("n"+x)]
                a = pd.concat(a, axis=1, join='outer')
            except Exception as e:
                print e.args
                print "Error on "+str(x)+" w error: "+str(e)

    return a

def normalize_response(d):
    
    a = pd.DataFrame(normalize_df(d["Response"]))
    a.columns=["nResponse"]
    #d_cat = pd.concat([d_cat, a], axis=1, join='outer')
    
    return a

In [15]:
df_n_2 = df_n.copy()
df_n_test_2 = df_test_n.copy()

df_n_2 = df_n_2[["Response"]+varTypes["categorical"]+varTypes["discrete"]]


df_n_test_2 = df_n_test_2[["Response"]+varTypes["categorical"]+varTypes["discrete"]]

df_n_2 = df_n_2.apply(normalize_df, axis=1)
df_n_test_2 = df_n_test_2.apply(normalize_df, axis=1)


TypeError: 'numpy.int64' object is not iterable

In [382]:
df_n_3 = pd.concat([df["Id"],df_n["Medical_Keyword_Sum"],df_n_2, df_n[varTypes["continuous"]]],axis=1,join='outer')
df_n_test_3 = pd.concat([df_test["Id"],df_test_n["Medical_Keyword_Sum"],df_n_test_2, df_test_n[varTypes["continuous"]]],axis=1,join='outer')

In [None]:
train_data = df_n_3.values
test_data = df_n_test_3.values

In [None]:
from sklearn import linear_model

clf = linear_model.Lasso(alpha = 0.1)
clf.fit(X_train, Y_train)
pred = clf.predict(X_test)

print accuracy_score(pred, Y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 1)
#model = model.fit(train_data[0:,2:],train_data[0:,0])

In [409]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


clf = GaussianNB()

clf.fit(train_data[0:,2:],train_data[0:,0])

pred = clf.predict(X_test)

print accuracy_score(pred, Y_test)

ValueError: need more than 1 value to unpack

In [410]:

from sklearn.metrics import accuracy_score

SyntaxError: invalid syntax (<ipython-input-410-b1fabbd013ba>, line 1)

ValueError: shapes (5934,) and (59381,) not aligned: 5934 (dim 0) != 59381 (dim 0)

In [340]:
df_n.columns.tolist()

['nMedical_History_41']

In [32]:
d_cat = df_cat.copy()
d_cat_test = df_cat_test.copy()

d_cont = df_cont.copy()
d_cont_test = df_cont_test.copy()

d_disc = df_disc.copy()
d_disc_test = df_disc_test.copy()

In [None]:
#df_cont_n = normalize_cols(d_cont, "continuous")
#df_cont_test_n = normalize_cols(d_cont_test, "continuous")

In [31]:
df_cat_n = normalize_cols(d_cat, "categorical")
df_cat_test_n = normalize_cols(d_cat_test, "categorical")

('could not convert string to float: A8',)
Error on Product_Info_2 w error: could not convert string to float: A8
('could not convert string to float: A8',)
Error on Product_Info_2 w error: could not convert string to float: A8


In [33]:
df_disc_n = normalize_cols(d_disc, "discrete")
df_disc_test_n = normalize_cols(d_disc, "discrete")

("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_1 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_10 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_15 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_24 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_32 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64'

In [21]:
a = df_cat_n.iloc[:,62:]

# TODO: Clump into function
#rows are normalized into binary columns of groupings 

<bound method Index.tolist of Index([u'nResponse', u'nProduct_Info_1', u'nProduct_Info_3', u'nProduct_Info_5', u'nProduct_Info_6', u'nProduct_Info_7', u'nEmployment_Info_2', u'nEmployment_Info_3', u'nEmployment_Info_5', u'nInsuredInfo_1', u'nInsuredInfo_2', u'nInsuredInfo_3', u'nInsuredInfo_4', u'nInsuredInfo_5', u'nInsuredInfo_6', u'nInsuredInfo_7', u'nInsurance_History_1', u'nInsurance_History_2', u'nInsurance_History_3', u'nInsurance_History_4', u'nInsurance_History_7', u'nInsurance_History_8', u'nInsurance_History_9', u'nFamily_Hist_1', u'nMedical_History_2', u'nMedical_History_3', u'nMedical_History_4', u'nMedical_History_5', u'nMedical_History_6', u'nMedical_History_7', u'nMedical_History_8', u'nMedical_History_9', u'nMedical_History_11', u'nMedical_History_12', u'nMedical_History_13', u'nMedical_History_14', u'nMedical_History_16', u'nMedical_History_17', u'nMedical_History_18', u'nMedical_History_19', u'nMedical_History_20', u'nMedical_History_21', u'nMedical_History_22', u'nMe

In [14]:
# Define various group by data streams

df = d
    
gb_PI2 = df.groupby('Product_Info_1')
gb_PI2 = df.groupby('Product_Info_2')

gb_Ins_Age = df.groupby('Ins_Age')
gb_Ht = df.groupby('Ht')
gb_Wt = df.groupby('Wt')

gb_response = df.groupby('Response')

In [None]:
#Outputs rows the differnet categorical groups

for c in df.columns:
    if (c in varTypes['categorical']):
        if(c != 'Id'):
            a = [ str(x)+", " for x in df.groupby(c).groups ]
            print c + " : " + str(a)
            
    

In [None]:
df_prod_info = pd.DataFrame(d, columns=["Response"]+ [ "Product_Info_"+str(x) for x in range(1,8)]) 

df_emp_info = pd.DataFrame(d, columns=["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)]) 

# continous
df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])

# all the values are discrete (0 or 1)
df_med_kw = pd.DataFrame(d, columns=["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])

# Grouping of various categorical data sets

### Histograms and descriptive statistics for Risk Response, Ins_Age, BMI, Wt

In [None]:
plt.figure(0)
plt.subplot(121)
plt.title("Categorical - Histogram for Risk Response")
plt.xlabel("Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df.Response)
plt.savefig('images/hist_Response.png')
print df.Response.describe()
print ""

plt.subplot(122)
plt.title("Normalized - Histogram for Risk Response")
plt.xlabel("Normalized Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df_cat_n.nResponse)
plt.savefig('images/hist_norm_Response.png')
print df_cat_n.nResponse.describe()
print ""

In [None]:

def plotContinuous(d, t):
    plt.title("Continuous - Histogram for "+ str(t))
    plt.xlabel("Normalized "+str(t)+"[0,1]")
    plt.ylabel("Frequency")
    plt.hist(d)
    plt.savefig("images/hist_"+str(t)+".png")
    #print df.iloc[:,:1].describe()
    print ""
    

for i in range(i,len(df_cat.columns:
    
    plt.figure(1)
    plotContinuous(df.Ins_Age, "Ins_Age")
plt.show()

In [26]:
df_disc.describe()[7:8]

Unnamed: 0,Id,Response,Medical_History_1,Medical_History_10,Medical_History_15,Medical_History_24,Medical_History_32
max,79146,8,240,240,240,240,240


In [None]:
plt.figure(1)
plt.title("Continuous - Histogram for Ins_Age")
plt.xlabel("Normalized Ins_Age [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Ins_Age)
plt.savefig('images/hist_Ins_Age.png')
print df.Ins_Age.describe()
print ""

plt.figure(2)
plt.title("Continuous - Histogram for BMI")
plt.xlabel("Normalized BMI [0,1]")
plt.ylabel("Frequency")
plt.hist(df.BMI)
plt.savefig('images/hist_BMI.png')
print df.BMI.describe()
print ""

plt.figure(3)
plt.title("Continuous - Histogram for Wt")
plt.xlabel("Normalized Wt [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Wt)
plt.savefig('images/hist_Wt.png')
print df.Wt.describe()
print ""

plt.show()

In [None]:


plt.show()

### Histograms and descriptive statistics for Product_Info_1-7

In [None]:
k=1
for i in range(1,8):
    '''
    print "The iteration is: "+str(i)
    print df['Product_Info_'+str(i)].describe()
    print ""
    '''
    
    plt.figure(i)
        
    if(i == 4):
       
        plt.title("Continuous - Histogram for Product_Info_"+str(i))
        plt.xlabel("Normalized value: [0,1]")
        plt.ylabel("Frequency")
        plt.hist(df['Product_Info_'+str(i)])
        plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
            
        
    else:
        
        if(i != 2):
            
            plt.subplot(1,2,1)
            plt.title("Cat-Hist- Product_Info_"+str(i))
            plt.xlabel("Categories")
            plt.ylabel("Frequency")
            plt.hist(df['Product_Info_'+str(i)])
            plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
            
            
            plt.subplot(1,2,2)
            plt.title("Normalized - Histogram of Product_Info_"+str(i))
            plt.xlabel("Categories")
            plt.ylabel("Frequency")
            plt.hist(df_cat_n['nProduct_Info_'+str(i)])
            plt.savefig('images/hist_norm_Product_Info_'+str(i)+'.png')
            
        elif(i == 2):
            plt.title("Cat-Hist Product_Info_"+str(i))
            plt.xlabel("Categories")
            plt.ylabel("Frequency")
            df.Product_Info_2.value_counts().plot(kind='bar') 
            plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
         
plt.show()



#Split dataframes into categorical, continuous, discrete, dummy, and response

In [None]:
catD = df.loc[:,varTypes['categorical']]
contD = df.loc[:,varTypes['continuous']]
disD = df.loc[:,varTypes['discrete']]
dummyD = df.loc[:,varTypes['dummy']]
respD = df.loc[:,['id','Response']]

#Descriptive statistics and scatter plot relating Product_Info_2 and Response

In [None]:
prod_info = [ "Product_Info_"+str(i) for i in range(1,8)]

a = catD.loc[:, prod_info[1]]

stats = catD.groupby(prod_info[1]).describe()

In [None]:
c = gb_PI2.Response.count()
plt.figure(0)

plt.scatter(c[0],c[1])

In [None]:
plt.figure(0)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((a.describe())['count']))
plt.ylabel("Frequency")

In [None]:


for i in range(1,8):
    a = catD.loc[:, "Product_Info_"+str(i)]
    if(i is not 4):
        print a.describe()
    print ""
    
    plt.figure(i)
    plt.title("Histogram of "+"Product_Info_"+str(i))
    plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
    plt.ylabel("Frequency")
    
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    
    if a.dtype in (np.int64, np.float, float, int):
        a.hist()
        
# Random functions
#catD.Product_Info_1.describe()
#catD.loc[:, prod_info].groupby('Product_Info_2').describe()
#df[varTypes['categorical']].hist()

In [None]:
catD.head(5)

In [None]:
#Exploration of the discrete data
disD.describe()

In [None]:
disD.head(5)

In [None]:
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    plt.title("Histogram of "+str(key))
    plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        df[key].hist()
    
    i+=1

In [None]:

#Iterate through each 'discrete' column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['discrete']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    #Histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    
    #Cumulative histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1

In [None]:
#2D Histogram

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    x = catD[key].value_counts(normalize=True)
    y = df['Response']
    
    plt.hist2d(x[1], y, bins=40, norm=LogNorm())
    plt.colorbar()
    
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1

In [None]:
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        #(1.*df[key].value_counts()/len(df[key])).hist()
        df[key].value_counts(normalize=True).plot(kind='bar')
    
    i+=1

In [1]:
df.loc('Product_Info_1')

NameError: name 'df' is not defined

hist_BMI.png
hist_Ins_Age.png
hist_norm_Product_Info_1.png
hist_norm_Product_Info_3.png
hist_norm_Product_Info_5.png
hist_norm_Product_Info_6.png
hist_norm_Product_Info_7.png
hist_norm_Response.png
hist_product_info_1.png
hist_Product_Info_2.png
hist_Product_Info_3.png
hist_product_info_4.png
hist_Product_Info_5.png
hist_Product_Info_6.png
hist_Product_Info_7.png
hist_response.png
hist_Wt.png
RFC_scatter_alpha_kappa_test1.png
RFC_scatter_alpha_kappa_test2.png
scatterLasso_alpha_kappa_test1.png
scatterLasso_alpha_kappa_test2.png
scatter_alpha_kappa.png
scatter_alpha_kappa_test1.png
scatter_alpha_kappa_test2.png
subplot_demo.py
