# Kotlin for data science

Datascience usually revolves around manipulating tabular data called *dataframes* and visualize them.
In this notebook, we will explore how to use Kotlin for data science, focusing on dataframes and plotting.

## Dataframe

[Official documentation](https://kotlin.github.io/dataframe/)

First, let's download the libraries we will use in this notebook (`%useLatestDescriptors` forces the download of the latest versions of the libraries).

In [31]:
%useLatestDescriptors
%use kandy
%use dataframe

Next, let's import a dataset into a dataframe.
We can find datasets in [Kaggle](https://www.kaggle.com/datasets) or [UCI Machine Learning Repository](https://archive.ics.uci.edu/).

In [32]:
val df = DataFrame.read("./04-datasets/ramen-ratings.csv")
df

Review #,Brand,Variety,Style,Country,Stars,Top Ten
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Se...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,
2575,Samyang Foods,Kimchi song Song Ramen,Pack,South Korea,4.75,
2574,Acecook,Spice Deli Tantan Men With Cilantro,Cup,Japan,4.0,
2573,Ikeda Shoku,Nabeyaki Kitsune Udon,Tray,Japan,3.75,
2572,Ripe'n'Dry,Hokkaido Soy Sauce Ramen,Pack,Japan,0.25,
2571,KOKA,The Original Spicy Stir-Fried Noodles,Pack,Singapore,2.5,


The dataframe library automatically creates a variable for each column of the dataframe.

In [33]:
df.Variety

Variety
T's Restaurant Tantanmen
Noodles Spicy Hot Sesame Spicy Hot Se...
Cup Noodles Chicken Vegetable
GGE Ramen Snack Tomato Flavor
Singapore Curry
Kimchi song Song Ramen
Spice Deli Tantan Men With Cilantro
Nabeyaki Kitsune Udon
Hokkaido Soy Sauce Ramen
The Original Spicy Stir-Fried Noodles


We can even use them to write neat code. This in first 3 reviews (since the column `Review #` has a space, we surround it with backticks):

In [34]:
df.sortBy { `Review #` }.take(3)

Review #,Brand,Variety,Style,Country,Stars,Top Ten
1,Westbrae,Miso Ramen,Pack,USA,0.5,
2,Wai Wai,Tom Yum Chili Flavor,Pack,Thailand,2.0,
3,Wai Wai,Tom Yum Shrimp,Pack,Thailand,2.0,


Ramens that have 5 stars (`toIntOrNull` returns null if the string cannot be converted to an integer):

In [35]:
df.filter { Stars.toIntOrNull() == 5 }
    .sortByDesc { Stars }

Review #,Brand,Variety,Style,Country,Stars,Top Ten
2570,Tao Kae Noi,Creamy tom Yum Kung Flavour,Pack,Thailand,5,
2569,Yamachan,Yokohama Tonkotsu Shoyu,Pack,USA,5,
2566,Nissin,Demae Ramen Bar Noodle Aka Tonkotsu F...,Pack,Hong Kong,5,
2563,Yamachan,Tokyo Shoyu Ramen,Pack,USA,5,
2559,Jackpot Teriyaki,Beef Ramen,Pack,USA,5,
2558,KOKA,Creamy Soup With Crushed Noodles Hot ...,Cup,Singapore,5,
2552,MyKuali,Penang White Curry Rice Vermicelli Soup,Bowl,Malaysia,5,
2550,Samyang Foods,Paegaejang Ramen,Pack,South Korea,5,
2545,KOKA,Instant Noodles Laksa Singapura Flavour,Pack,Singapore,5,
2543,KOKA,Curry Flavour Instant Noodles,Cup,Singapore,5,


The number od words in each variety:

In [36]:
df.map { Variety.split(" ").size }

[3, 9, 4, 5, 2, 4, 6, 3, 4, 5, 5, 3, 5, 5, 9, 4, 3, 3, 3, 2, 4, 2, 10, 3, 4, 5, 7, 5, 6, 6, 2, 6, 6, 4, 5, 5, 8, 4, 5, 10, 7, 6, 9, 6, 3, 9, 4, 3, 3, 3, 6, 3, 7, 2, 3, 2, 3, 4, 5, 8, 9, 5, 2, 3, 4, 8, 4, 7, 3, 4, 6, 5, 3, 4, 6, 4, 2, 6, 3, 3, 5, 7, 4, 3, 2, 4, 4, 3, 2, 3, 4, 4, 5, 3, 3, 3, 5, 2, 5, 2, 3, 4, 6, 4, 3, 5, 7, 6, 6, 5, 11, 5, 3, 3, 7, 5, 6, 4, 7, 4, 3, 6, 2, 5, 7, 8, 7, 7, 5, 8, 7, 6, 4, 6, 7, 7, 6, 5, 8, 5, 3, 10, 6, 5, 5, 4, 6, 5, 6, 5, 6, 4, 7, 5, 5, 4, 6, 8, 7, 3, 5, 5, 6, 5, 5, 4, 5, 4, 4, 6, 6, 6, 6, 5, 8, 5, 5, 4, 4, 5, 6, 6, 5, 4, 5, 4, 4, 3, 8, 5, 6, 3, 5, 7, 5, 8, 7, 9, 3, 6, 6, 12, 8, 7, 7, 8, 6, 5, 4, 2, 4, 3, 3, 5, 3, 3, 3, 4, 7, 8, 5, 2, 5, 6, 4, 4, 11, 8, 5, 2, 7, 5, 3, 8, 4, 4, 7, 9, 9, 9, 6, 9, 9, 6, 8, 8, 7, 7, 8, 9, 9, 6, 7, 10, 7, 6, 2, 4, 5, 5, 5, 2, 3, 8, 5, 4, 5, 7, 6, 4, 6, 5, 5, 3, 9, 3, 5, 5, 1, 5, 9, 4, 7, 8, 4, 3, 5, 8, 5, 4, 11, 4, 6, 6, 8, 3, 4, 6, 4, 8, 6, 9, 2, 5, 9, 5, 5, 7, 10, 5, 4, 5, 5, 11, 6, 7, 5, 7, 4, 4, 4, 7, 7, 5, 5, 3, 6, 5, 3, 4,

We can even add new columns to the dataframe:

In [44]:
var dfPlus = df.add("VarietyWords") { Variety.split(" ").size }
dfPlus

Review #,Brand,Variety,Style,Country,Stars,Top Ten,VarietyWords
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,,3
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Se...,Pack,Taiwan,1.0,,9
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,,4
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,,5
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,,2
2575,Samyang Foods,Kimchi song Song Ramen,Pack,South Korea,4.75,,4
2574,Acecook,Spice Deli Tantan Men With Cilantro,Cup,Japan,4.0,,6
2573,Ikeda Shoku,Nabeyaki Kitsune Udon,Tray,Japan,3.75,,3
2572,Ripe'n'Dry,Hokkaido Soy Sauce Ramen,Pack,Japan,0.25,,4
2571,KOKA,The Original Spicy Stir-Fried Noodles,Pack,Singapore,2.5,,5


## Kandy

[Official documentation](https://kotlin.github.io/kandy/)

Plot of the number of words in each variety per review:

In [50]:
dfPlus.plot {
    line {
        x(`Review #`)
        y(VarietyWords)
    }
}

Histogram of the number of reviews per Country:

In [38]:
val reviewsPerCountDf = df.groupBy { Country }.aggregate {
    count() into "Count"
}

reviewsPerCountDf

Country,Count
Japan,352
Taiwan,224
USA,323
India,31
South Korea,309
Singapore,109
Thailand,191
Hong Kong,137
Vietnam,108
Ghana,2


In [39]:
reviewsPerCountDf.sortBy { Count }.plot { bars {
    x(Country)
    y(Count)
} }