## Imputing missing values with predicting values on Titanic dataset
http://www.ultravioletanalytics.com/2014/11/03/kaggle-titanic-competition-part-ii-missing-values/
 Some predictive models inherently are able to deal with missing data (neural networks come to mind) and others require that the missing values be dealt with separately. The RandomForestClassifier model in scikit-learn is not able to handle missing values, so we’ll need to use some different approaches to assign values before training the model. The following is a partial list of ways missing values can be dealt with:
 Assign a value that indicates a missing value – This is particularly appropriate for categorical variables (more on this in the next post). The value is missing can be useful information in and of itself. Perhaps when a value is missing for a particular variable, that has some underlying cause that makes it correlate more highly with another value.   
> More techniques will be presented in PreprocessingData.ipynb notebook

We combine the data from the two files into one for a simple reason: when we perform feature engineering on the features, it’s often useful to know the full range of possible values, as well as the distributions of all known values. This will require that we keep track of the training and test data during our processing, but it turns out to not be too difficult.

In [31]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# read in the training and testing data into Pandas.DataFrame objects
input_df = pd.read_csv('C:/Dataset/titanic/train.csv', header=0)
submit_df  = pd.read_csv('C:/Dataset/titanic/test.csv',  header=0)

# merge the two DataFrames into one
df = pd.concat([input_df, submit_df])

# re-number the combined data set so there aren't duplicate indexes
df.reset_index(inplace=True)

# reset_index() generates a new column that we don't want, so let's get rid of it
df.drop('index', axis=1, inplace=True)

# the remaining columns need to be reindexed so we can access the first column at '0' instead of '1'
df = df.reindex_axis(input_df.columns, axis=1)

print(df.shape[1], "columns:", df.columns.values)
print("Row count:", df.shape[0])

12 columns: ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
Row count: 1309


## Assign a specified value

In [32]:
# Replace missing values with "U0"
df['Cabin'].fillna('U0',inplace=True)

In [33]:
# Impute missing Embarked with most frequent value
df['Embarked'].fillna(df['Embarked'].value_counts().index[0],inplace=True)

In [21]:
df.groupby("Sex")["Fare"].mean()

Sex
female    46.198097
male      26.154601
Name: Fare, dtype: float64

In [10]:
df[['Fare','Sex']].groupby('Sex').mean()

Unnamed: 0_level_0,Fare
Sex,Unnamed: 1_level_1
female,46.198097
male,26.154601


In [36]:
# Imputing missing far by category cabin. Notfe that we fill with mean values based on groupby, we cannot use
# Incorrect: df['Fare'].fillna(df.groupby("Sex")["Fare"].mean())). Instead, we use transform method with mean
df['Fare'].fillna(df.groupby("Sex")["Fare"].transform("mean"), inplace=True)

In [37]:
df.isnull().sum()

PassengerId      0
Survived       418
Pclass           0
Name             0
Sex              0
Age            263
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         0
dtype: int64

## Use a regression or another simple model to predict the values of missing variables 
With important feature, simple filling mean value is not a good way, we can use regression for this task

In [38]:
 def setMissingAges(df):
    from sklearn.ensemble import RandomForestRegressor
    # Selec features to be included in a Random Forest Regressor
    df = df[['Age','Sex','Embarked','Fare', 'Parch', 'SibSp','Pclass']]
    # transform male, female to 1,0
    df.loc[:,'Sex']=df.loc[:,'Sex'].map({'male':1, 'female':0})
    # transform Embark into numeric using alternative method
    df['Embarked'].replace({'S':0,'C':1,'Q':2},inplace=True)
    # Split into sets with known and unknown Age values
    knownAge = df.loc[ (df.Age.notnull()) ]
    unknownAge = df.loc[ (df.Age.isnull()) ]
    # All age values are stored in a target array
    y = knownAge.values[:, 0]
    # All the other values are stored in the feature array
    X = knownAge.values[:, 1::]
    
    # Create and fit a model
    rf = RandomForestRegressor(n_estimators=2000, n_jobs=-1)
    rf.fit(X, y)
    
    # Use the fitted model to predict the missing values
    df.loc[ (df.Age.isnull()), 'Age' ] = rf.predict(unknownAge.values[:, 1::])
   
        # Assign those predictions to the full data set
    #df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges 
    
    return df[['Age']]
#df=setMissingAges(df)
df[['Age']]=setMissingAges(df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [39]:
df.isnull().sum()

PassengerId      0
Survived       418
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         0
dtype: int64

## Impute missing values with fancyImpute

In [None]:
import pandas as pd
import numpy as np
from fancyimpute import KNN

In [None]:
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float]).as_matrix()

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

In [None]:
df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index

## Feature Engineering using dummy variable

Also known as Categorical variable or Binary Variables, Dummy Variables can be used most effectively when a qualitative variable has a small number of distinct values that occur somewhat frequently. In the case of the Embarked variable in the Titanic dataset, there are three distinct values -> ‘S’, ‘C’, and ‘Q’. 

In [40]:
import pandas as pd

# Create a dataframe of dummy variables for each distinct value of 'Embarked'
#dummies_df = pd.get_dummies(df['Embarked'])

# Rename the columns from 'S', 'C', 'Q' to 'Embarked_S', 'Embarked_C', 'Embarked_Q'
#dummies_df = dummies_df.rename(columns=lambda x: 'Embarked_' + str(x))

# Add the new variables back to the original data set
#df = pd.concat([df, dummies_df], axis=1)

# (or written as a one-liner):
df = pd.concat([df, pd.get_dummies(df['Embarked']).rename(columns=lambda x: 'Embarked_' + str(x))], axis=1)
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_C,Embarked_Q,Embarked_S
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,U0,S,0.0,0.0,1.0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1.0,0.0,0.0
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,U0,S,0.0,0.0,1.0


## Feature Engineering with Factorizing
Pandas has a method called factorize() that creates a numerical categorical variable from any other variable, assigning a unique ID to each distinct value encountered. This is especially useful for transforming an alphanumeric categorical variable into a numerical categorical variable. In some ways creating a factor variable is similar to dummy variables, in that it allows you to generate a numerical category, but in this case it does this within a single variable.

A categorical variable representing the letter of the Cabin can be created with the following code:

In [271]:
import re

# Replace missing values with "U0"
df['Cabin'].fillna('U0')

# create feature for the alphabetical part of the cabin number
df['CabinLetter'] = df['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group())

# convert the distinct cabin letters with incremental integer values
df['CabinLetter'] = pd.factorize(df['CabinLetter'])[0]

3) Scaling is a technique used to address an issue with some models that variables with wildly different scales will be treated in proportion to the magnitude of their values. For example, Age values will likely max out around 100 while household income values may max out in the millions. Some models are sensitive to the magnitude of the values of the variables, so scaling all values by some constant can help to adjust the influence of each variable. Additionally, scaling can be performed in such a way to compress all values into a specific range (typically -1 to 1, or 0 to 1). This isn’t necessary for RandomForest models, but is very helpful in other models you may want to try out with this dataset.

In [272]:
# StandardScaler will subtract the mean from each value then scale to the unit variance
scaler = StandardScaler()
df['Age_scaled'] = scaler.fit_transform(df['Age'])



#### Binning
Binning is a term used to indicate creating quantiles. This allows you to create an ordered, categorical variable out of a range of values. In algorithms that respond effectively use categorical information this can be useful (probably not so great for linear regression).

In [276]:
df.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Embarked_C',
       'Embarked_Q', 'Embarked_S', 'CabinLetter', 'Age_scaled', 'Fare_bin'], dtype=object)

In [277]:
# Divide all fares into quartiles
df['Fare_bin'] = pd.qcut(df['Fare'], 4)

In [284]:
df.shape,df['Fare_bin'].shape

((1309, 18), (1309,))

In [286]:
?pd.factorize

In [279]:
# qcut() creates a new variable that identifies the quartile range, but we can't use the string so either
# factorize or create dummies from the result 
# https://stackoverflow.com/questions/39390160/pandas-factorize-on-an-entire-data-frame
df['Fare_bin_id'] = pd.factorize(df['Fare_bin'])

ValueError: Length of values does not match length of index

In [273]:
df = pd.concat([df, pd.get_dummies(df['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))], axis=1)

ValueError: Length of values does not match length of index

### New feature
Any variable that is generated from one or more existing variables is called a “derived” variable. We’ve discussed basic transformations that result in useful derived variables, and in this post we’ll look at some more interesting derived variables that aren’t simple transformations.

An important aspect of feature engineering is using insight and creativity to find new features to feed the model. You’ll read this over and over again, and it really can’t be emphasized enough – feature engineering is a hugely important part of the data science pipeline and is where you should spend the most time and effort. The basic transformations and interaction variables that we can automate (more on that later) don’t take too much time, so that leaves us with efforts to creatively find new variables from the raw data.

Very basic examples of a useful derived variable might be pulling the country code and/or area code out of telephone numbers, or extracting country/state/city from GPS coordinates. Any time a qualitative variable represents an object in the world that we know something about, there is an opportunity to derive variables from it. Also, if a data set represents a timeseries or other historical behavioral information that can also provide a great opportunity for uncovering derived variables.

The titanic data set is very simple, and doesn’t really have a LOT to work with, but there are some text fields which provide us a few opportunities.

Name
The Name variable is useless on it’s own, but provides us the most to work with. Two obvious opportunities are:

Names – perhaps if you have more (or less) names that indicates something about your status what would effect your ability to get on a lifeboat?

In [None]:
# how many different names do they have? 
df['Names'] = df['Name'].map(lambda x: len(re.split(' ', x)))

Title – How you are addressed can definitely indicate status (and gender) which had some influence on getting on a lifeboat

In [None]:
# What is each person's title? 
df['Title'] = df['Name'].map(lambda x: re.compile(", (.*?).").findall(x)[0])

# Group low-occuring, related titles together
df['Title'][df.Title == 'Jonkheer'] = 'Master'
df['Title'][df.Title.isin(['Ms','Mlle'])] = 'Miss'
df['Title'][df.Title == 'Mme'] = 'Mrs'
df['Title'][df.Title.isin(['Capt', 'Don', 'Major', 'Col', 'Sir'])] = 'Sir'
df['Title'][df.Title.isin(['Dona', 'Lady', 'the Countess'])] = 'Lady'

# Build binary features
df = pd.concat([df, pd.get_dummies(df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)

FamilyID – A great example of using creativity to tie together several variables, Trevor Stephens created a really interesting derivied variable by identifying family members from last name and total family size. It’s in R and I decided not to duplicate it here, but definitely worth a look

Cabin
Not a lot to do here, but a little research into the deckplans (or a little common sense) indicates that the letter in the cabin variable is the deck, and the number is the room number. The room numbers increased towards the back of the boat, so perhaps that provides some useful measure of location. Additionally, different decks also provide some information on location as well as socioeconomic status, again valuable determining who gets on the lifeboats.

In [18]:
# Replace missing values with "U0"
df['Cabin'][df.Cabin.isnull()] = 'U0'

# Create a feature for the deck
df['Deck'] = df['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group())
df['Deck'] = pd.factorize(df['Deck'])[0]

# Create binary features for each deck
decks = pd.get_dummies(df['Deck']).rename(columns=lambda x: 'Deck_' + str(x))
df = pd.concat([df, decks], axis=1)

# Create feature for the room number
df['Room'] = df['Cabin'].map( lambda x : re.compile("([0-9]+)").search(x).group()).astype(int) + 1

SyntaxError: invalid character in identifier (<ipython-input-18-d984e2dfd1eb>, line 1)

Ticket
This variable is clearly ripe for extracting information, but it’s not immediately clear what the values mean. Some quick googling didn’t turn up any information on decoding the values, so we’ll have to make some guesses. After sorting all the values and examining them, a few things give us some clues:

    About a quarter of the tickets have an alphanumeric prefix while the rest consist only of a number
    There are 45 distinct prefixes initially. If we remove ‘.’ and ‘/’ characters (which appear to be superfluous) and make a few other adjustments that number drops to 29.
    The number part of the value seems to have some loose correlations – numbers starting with 1 are usually first class tickets, 2 usually second, and 3 third. I say usually because it holds for a majority of examples but not all. There are also tickets numbers starting with 4-9, and those are rare and almost exclusively third class.
    I can’t seem to notice any pattern to whether the ticket number is a 4, 5, or 6-digit number, but that may provide some amount of information as well.
    Several people can share a ticket number. This could be used to create another feature very similar to the familyID, except this would cover situations like nannies, or close friends which would probably act like a family unit that is being captured in the familyID


In [None]:
def processTicket():
    global df
    
    # extract and massage the ticket prefix
    df['TicketPrefix'] = df['Ticket'].map( lambda x : getTicketPrefix(x.upper()))
    df['TicketPrefix'] = df['TicketPrefix'].map( lambda x: re.sub('[.?/?]', '', x) )
    df['TicketPrefix'] = df['TicketPrefix'].map( lambda x: re.sub('STON', 'SOTON', x) )
        
    # create binary features for each prefix
    prefixes = pd.get_dummies(df['TicketPrefix']).rename(columns=lambda x: 'TicketPrefix_' + str(x))
    df = pd.concat([df, prefixes], axis=1)
    
    # factorize the prefix to create a numerical categorical variable
    df['TicketPrefixId'] = pd.factorize(df['TicketPrefix'])[0]
    
    # extract the ticket number
    df['TicketNumber'] = df['Ticket'].map( lambda x: getTicketNumber(x) )
    
    # create a feature for the number of digits in the ticket number
    df['TicketNumberDigits'] = df['TicketNumber'].map( lambda x: len(x) ).astype(np.int)
    
    # create a feature for the starting number of the ticket number
    df['TicketNumberStart'] = df['TicketNumber'].map( lambda x: x[0:1] ).astype(np.int)
    
    # The prefix and (probably) number themselves aren't useful
    df.drop(['TicketPrefix', 'TicketNumber'], axis=1, inplace=True)
    

def getTicketPrefix(ticket):
    match = re.compile("([a-zA-Z./]+)").search(ticket)
    if match:
        return match.group()
    else:
        return 'U'

def getTicketNumber(ticket):
    match = re.compile("([d]+$)").search(ticket)
    if match:
        return match.group()
    else:
        return '0'

Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features. The simple approach that we use in this example is to perform basic operators (add, subtract, multiply, divide) on each pair of numerical features. We could also get much more involved and include more than 2 features in each calculation, and/or use other operators (sqrt, ln, trig functions, etc).

In [None]:
numerics = df.loc[:, ['Age_scaled', 'Fare_scaled', 'Pclass_scaled', 'Parch_scaled', 'SibSp_scaled', 
                      'Names_scaled', 'CabinNumber_scaled', 'Age_bin_id_scaled', 'Fare_bin_id_scaled']]

# for each pair of variables, determine which mathmatical operators to use based on redundancy
for i in range(0, numerics.columns.size-1):
    for j in range(0, numerics.columns.size-1):
        col1 = str(numerics.columns.values[i])
        col2 = str(numerics.columns.values[j])
        # multiply fields together (we allow values to be squared)
        if i <= j:
            name = col1 + "*" + col2
            df = pd.concat([df, pd.Series(numerics.iloc[:,i] * numerics.iloc[:,j], name=name)], axis=1)
        # add fields together
        if i < j:
            name = col1 + "+" + col2
            df = pd.concat([df, pd.Series(numerics.iloc[:,i] + numerics.iloc[:,j], name=name)], axis=1)
        # divide and subtract fields from each other
        if not i == j:
            name = col1 + "/" + col2
            df = pd.concat([df, pd.Series(numerics.iloc[:,i] / numerics.iloc[:,j], name=name)], axis=1)
            name = col1 + "-" + col2
            df = pd.concat([df, pd.Series(numerics.iloc[:,i] - numerics.iloc[:,j], name=name)], axis=1)

This process of automated feature generation can quickly produce a LOT of new variables. In our case, we use 9 features to generate 176 new interaction features. In a larger data set with dozens or hundreds of numeric features, this process can generate an overwhelming number of new interactions. Some types of models are really good at handling a very large number of features (I’ve heard of thousands to millions), which would be necessary in such a case.

It’s very likely that some of the new interaction variables are going to be highly correlated with one of their original variables, or with other interactions, which can be a problem especially for linear models. Highly correlated variables can cause an issue called “multicollinearity”. There is a lot of information out there about how to identify, deal with, and safely ignore multicollinearity in a data set so I’ll avoid an explanation here, but I’ve included some great links at the bottom of this post if you’re interested.

In our solution for the Titanic challenge, I don’t believe that multicollinearity is a problem specifically because Random Forests are not a linear model. Removing highly correlated features is a good idea anyway though, if for no other reason than to improve performance. We’ll use a Spearman correlation to identify and remove highly correlated features. We identify highly correlated features using Spearman’s rank correlation coefficient but you could certainly experiment with other methods such as Pearson product-moment correlation coefficient.