In [1]:
%useLatestDescriptors
%use dataframe (0.9.1), lets-plot

In [48]:
var df = DataFrame.readCSV(fileOrUrl = "../src/main/resources/titanic.csv", delimiter = ';', parserOptions = ParserOptions(locale = java.util.Locale.FRENCH))

df.head()

We have a dataset which uses an alternative pattern for decimal numbers. This is a reason why the French locale will be used in the example.

But before data conversion, we should to handle *null* values.

In [11]:
df.describe()

In [12]:
df

# Imputing null values
Let's convert all columns of our dataset to non-nullable and impute null values based on mean values.

In [13]:
val df1 = df
    // imputing
    .fillNulls { sibsp and parch and age and fare }.perCol { mean() }
    .fillNulls { sex }.withValue("female")
    .fillNulls { embarked }.with { "S" }
    .convert { sibsp and parch and age and fare }.toDouble()

df1.head()

In [14]:
df1.schema()

pclass: Int
survived: Int
name: String
sex: String
age: Double
sibsp: Double
parch: Double
ticket: String
fare: Double
cabin: String?
embarked: String
boat: String?
body: Int?
homedest: String?


In [15]:
df1.corr()

In [16]:
val correlations = df1.corr { all() }.with { survived }
    .sortBy { survived }
correlations

Great, at this moment we have 5 numerical features available for numerical analysis: **pclass, age, sibsp, parch, fare**.

# Analyze by pivoting features
To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

- **Pclass**: We observe significant correlation (>0.5) among **Pclass**=1 and **Survived**.

- **Sex**: We confirm the observation during problem definition that Sex=female had a very high survival rate at 74%.

- **SibSp** and **Parch**: These features have zero correlation for the certain values. It may be best to derive a feature or a set of features from these individual features.

In [17]:
df1.groupBy { pclass }.mean { survived }.sortBy { pclass }

In [18]:
df1.groupBy { sex }.mean { survived }.sortBy { survived }

In [19]:
df1.groupBy { sibsp }.mean { survived }.sortBy { sibsp }

In [20]:
df1.groupBy { parch }.mean { survived }.sortBy { parch }

# Analyze the importance of the Age feature

It's interesting to discover both **age** distributions: among survived and not survived passengers.

In [21]:
val byAge = df1.valueCounts { age }.sortBy { age }
byAge

In [22]:
// JetBrains color palette
val colors = mapOf("light_orange" to "#ffb59e", "orange" to "#ff6632", "light_grey" to "#a6a6a6", "dark_grey" to "#4c4c4c")

In [23]:
letsPlot(byAge.toMap()) { x = "age"; y = "count" } + 
    geomPoint(size = 5, color = colors["dark_grey"]) +
    ggsize(850, 500)

In [24]:
val age = df.select { age }.dropNulls().sortBy { age }

letsPlot(age.toMap()) { x = "age" } + geomHistogram(binWidth=5, fill = colors["orange"]) + ggsize(850, 500)

In [25]:
df1.groupBy { age }.pivotCounts { survived }.sortBy { age }

In [26]:
val survivedByAge = df1.select { survived and age }.sortBy { age }
survivedByAge

In [27]:
val plot = letsPlot(survivedByAge.convert { survived }.with { if (it == 1) "Survived" else "Died" }.toMap())

plot +
    geomHistogram(binWidth = 5, alpha = 0.7, position = Pos.dodge) { x = "age"; fill = "survived" } +
    scaleFillManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(850, 500)

In [28]:
// Density plot
plot +
    geomDensity { x="age"; color="survived" } +
    scaleColorManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(850, 250)

In [29]:
// A basic box plot
plot +
    geomBoxplot { x="survived"; y="age"; fill = "survived" } +
    scaleFillManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(500, 400)

Seems like we have the same age distribution among survived and not survived passengers.

# Categorical features with One Hot Encoding

To prepare data for the ML algorithms, we should replace all String values in categorical features on numbers. There are a few ways of how to preprocess categorical features, and One Hot Encoding is one of them. We will use [`pivotMatches`](https://kotlin.github.io/dataframe/pivot.html#pivotmatches) operation to convert categorical columns into sets of nested `Boolean` columns per every unique value.

In [30]:
val pivoted = df1.pivotMatches { pclass and sex and embarked }
pivoted.head()

In [31]:
val df2 = pivoted
            // feature extraction
            .select{ survived and pclass and sibsp and parch and age and fare and sex and embarked}
            .convert { allDfs() }.toDouble()

df2.head()

In [32]:
val titanicData = df2.flatten().toMap()

gggrid(
    listOf(
        CorrPlot(titanicData, "Tiles").tiles()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(),
        CorrPlot(titanicData, "Points").points()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(), 
        CorrPlot(titanicData, "Tiles and labels").tiles().labels()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(),
        CorrPlot(titanicData, "Tiles, points and labels").points().labels().tiles()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build()
    ), 1, 700, 600)

# Creation of new features

We suggest to combine both, **Sibsp** and **parch** features, into the new one feature with the name **FamilyNumber** as a simple sum of **sibsp** and **parch**.

In [33]:
val familyDF = df1.add("familyNumber") { sibsp + parch }

familyDF.head()

In [34]:
familyDF.corr { familyNumber }.with { survived }

In [35]:
familyDF.corr { familyNumber }.with { age }

Looks like the new feature has no influence on the **survived** column, but it has a strong negative correlation with **age**. 

# Titles
Let's try to extract something from the names. A lot of string in the name column contains special titles, like Done, Mr, Mrs and so on.

In [36]:
val titledDF = df.select { survived and name }.add ("title") { name.split(".")[0].split(",")[1].trim() }
titledDF.head(100)

In [37]:
titledDF.valueCounts { title }

New **Title** column contains some rare titles and some titles with typos. Let's clean the data and merge rare titles into one category.

In [38]:
val rareTitles = listOf("Dona", "Lady", "the Countess", "Capt", "Col", "Don", 
                "Dr", "Major", "Rev", "Sir", "Jonkheer")

val cleanedTitledDF = titledDF.update { title }.with { 
                            when {
                                it == "Mlle" -> "Miss"
                                it == "Ms" -> "Miss"
                                it == "Mme" -> "Mrs"
                                it in rareTitles -> "Rare Title"
                                else -> it
                            }
                        }

In [39]:
cleanedTitledDF.valueCounts { title }

Now it looks awesome and we have only 5 different titles and could see how it correlates with survival.

In [40]:
val correlations = cleanedTitledDF
                    .pivotMatches { title }
                    .corr { title }.with { survived }
correlations

In [41]:
correlations.update { title }.with { it.substringAfter('_') }.filter { title != "survived" }

The women with title **Miss** and **Mrs** have the same chances to survive, but not the same for the men. If you have a title **Mr**, your deals are bad on the Titanic.

**Rare title** is really rare and doesn't play a big role.

In [42]:
val groupedCleanedTitledDF = cleanedTitledDF.valueCounts { title and survived }.sortBy { title and survived }
groupedCleanedTitledDF

# Surname's analysis
It's very interesting to dig deeper into families, home destinations, and we could do start this analysis from surnames which could be easily extracted from **Name** feature.

In [43]:
val surnameDF = df1.select { survived and name }.add ("surname") { name.split(".")[0].split(",")[0].trim() }
surnameDF.head()

In [44]:
surnameDF.valueCounts { surname }

In [45]:
surnameDF.surname.countDistinct()

875

In [46]:
val firstSymbol by column<String>()

df1
.add (firstSymbol) { name.split(".")[0].split(",")[0].trim().first().toString() }
.pivotMatches(firstSymbol)
.corr { firstSymbol }.with { survived }
