_Prepared for the TRAPEGO kick-off meeting on 3 March 2021 by [Ueli Reber](https://uelireber.ch)_

<img src="teaser.png" alt="drawing" width="600"/>

You have probably asked yourself countless times what kind of pesticide you would be if you were one. Well, today is your lucky day: you will find out by learning from data! 🤓

First you will compile some data about you, then get and prepare pesticide descriptions, do some text mining wizardry, and finally learn your pesticide. Let's get started!

## Your data

Create your profile by answering the following (extremely scientific) questions. Type your answer in the code field (between the quotation marks) and then click the **► Run** button above.

1) What do you prefer: `"fruits"` or `"vegetables"`?

In [None]:
a_1 <- "TYPE-YOUR-ANSWER-HERE"

2) So, what is your favorite fruit/vegetable then? You can type in more than type (comma separated).

In [None]:
a_2 <- "TYPE-YOUR-ANSWER-HERE"

3) What menu would you choose for lunch at the canteen?

* `"Beef Stroganoff with chili peppers, onions, mushrooms & fried noodles"`
* `"Pad Thai with beetroot pancakes, soybean sprouts, snow peas, carrots & sweet sour sauce"`
* `"Wild salmon fillet in puff pastry, creamed savoy cabbage & parsley potatoes"`

In [None]:
a_3 <- "TYPE-YOUR-ANSWER-HERE"

4) What spread do you enjoy on your breakfast bread: `"jam"`, `"honey"`, `"nutella"`, just `"butter"`, or are you more a `"cereal"` type of person?

In [None]:
a_4 <- "TYPE-YOUR-ANSWER-HERE"

5) Finally, clothing: What material is your sweater/T-shirt made of?

* `"Cotton"`
* `"Wool"`
* `"Some synthetic fabric"`

In [None]:
a_5 <- "TYPE-YOUR-ANSWER-HERE"

Okay, enough about you, let's move on to the pesticide descriptions!

## Pesticide descriptions

To determine your pesticide, we need data. In our case, this are the names and descriptions of different pesticide. Go ahead and load the data (along with packages required below).

In [None]:
# load required packages
library(tidyverse)
library(quanteda)

# load pesticide descriptions
pest_df <- read_csv("data/pesticides.csv", 
                     col_types = cols()) 

# have a look
head(pest_df)

This looks alright! Let's move on then and find out which pesticide is yours.

## Model

To identify your pesticide we use the [nearest neighbor search](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) (also k-nearest neighbor or KNN). This is a relatively simple method that is often used to find objects that are similar to each other. The idea is that given an object, nearest neighbor search identifies objects (in your data) that are similar to the input object. This is exactly what we need in order to find the pesticide closest to you!

However, before we proceed to modeling, we have to bring your data and the descriptions of the pesticides into the correct form (preprocessing).

In [None]:
# preprocessing of the data (including yours)
pest_dfm <- pest_df %>%
  add_row(doc_id = nrow(pest_df) + 1, name = NA, type = NA, 
          text = paste(a_1, a_2, a_3, a_4, a_5)) %>%
  corpus() %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords(language = "en")) %>%
  tokens_wordstem(language = "en") %>%
  dfm() %>%
  dfm_tfidf()

# get an idea of the data structure
pest_dfm

Okay, let's get serious now and finally identify your pesticide!

In [None]:
# extract training set, i.e. the descriptions
pest_train <- pest_dfm[-nrow(pest_dfm), ]
# extract testing set, i.e. your data
pest_test <- pest_dfm[nrow(pest_dfm), ]

# extract predict information from training data, i.e. pesticide names 
pest_target <- pest_train$name

# run knn function, i.e. predict your pesticide
pest_pred <- class::knn(train = pest_train,
                        test = pest_test,
                        cl = pest_target, 
                        k = 1)

Eureka, that's it! Now we just have to print it.

In [None]:
#  extract pesticide information
pest_res <- pest_df[pest_df$name == as.character(pest_pred), ]

# learn your pesticide
IRdisplay::display_markdown(paste0("📣 **", pest_res$name, " (", tolower(pest_res$type), ")**: ", pest_res$text))

Brilliant, you have found your pesticide using data science! Congratulations! 👏

![](https://media.giphy.com/media/xUPGcuomRFMUcsB9nO/giphy-downsized.gif)

<sub>**Disclaimer:** This exersice does not contain any science. 😉</sub>