# What pesticide are you? Find out with data science!

Author: [Ueli Reber](https://uelireber.ch)  
Version: 2021-02-25

You have probably asked yourself for a long time what kind of pesticide you would be if you were one. Well, today is your lucky day: you will find out by learning from data! 😃

First you will compile some data about you, then get and prepare pesticide descriptions, do some text mining, and ultimately identify your pesticide. Let's get started!

## Your data

Create your profile by answering the following (extremely scientific) questions. Type your answer in the code field (between the quotation marks) and then click the **► Run** button above.

1) What do you prefer: `"fruits"` or `"vegetables"`?

In [16]:
a_1 <- "TYPE-YOUR-ANSWER-HERE"

2) So, what is your favorite fruit/vegetable then? You can type in more than type (comma separated).

In [17]:
a_2 <- "TYPE-YOUR-ANSWER-HERE"

3) What menu would you choose for lunch at the canteen?

* `"Beef Stroganoff with chili peppers, onions, mushrooms & fried noodles"`
* `"Pad Thai with beetroot pancakes, soybean sprouts, snow peas, carrots & sweet sour sauce"`
* `"Wild salmon fillet in puff pastry, creamed savoy cabbage & parsley potatoes"`

In [18]:
a_3 <- "TYPE-YOUR-ANSWER-HERE"

4) What spread do you enjoy on your breakfast bread: `"jam"`, `"honey"`, `"nutella"`, just `"butter"`, or are you more a `"cereal"` type of person?

In [19]:
a_4 <- "TYPE-YOUR-ANSWER-HERE"

5) Finally, clothing: What material is your sweater/T-shirt made of?

* `"Cotton"`
* `"Wool"`
* `"Some synthetic fabric"`

In [20]:
a_5 <- "TYPE-YOUR-ANSWER-HERE"

Okay, enough about you, let's move on to the pesticide descriptions!

## Pesticide descriptions

To determine your pesticide, we need data. In our case, this are the names and descriptions of different pesticide. Go ahead and load the data (along with packages required below).

In [22]:
# load required packages
library(tidyverse, quietly = TRUE)
library(quanteda, quietly = TRUE)

# load pesticide descriptions
pest_df <- read_csv("data/pesticides.csv", 
                     col_types = cols()) 

# have a look
head(pest_df)

doc_id,name,type,text
1,Azoxystrobin,Fungicide,"An experimental compound used on cereals, vegetables, fruit crops, peanuts, turf, ornamentals, stone fruit, bananas, rice, apples, grapes, & potatoes. This chemical does not leach and is unlikely to contaminate water bodies. It is found to exhibit very low ecological risks, to aquatic life, birds, and mammals. Other names include Abound, Amistar, Bankit, Heritage, and Quadris."
2,Boscalid,Fungicide,"Fungicide used on specialty crops such as straberries, beans, stone fruit, tree nuts, root vegetables, carrots, grapes, Brassica vegetables, and sunflowers."
3,Carbendazim (MBC),Fungicide,"Found to be acutely toxic to honeybees, having an effect on long term survival of colonies. Foods with Carbendazim residues include: strawberries, green beans, apple sauce, blueberries, sweet bell peppers, apples, cherries, green onions, spinach, bananas, honey, lettuce, water, celery, cauliflower, celery & broccoli."
4,Chlorothalonil,Fungicide,"General use insecticide used on trees, small fruits, turf, ornamentals, and vegetables. Found to be non-toxic to honey bees."
5,Cyprodinil,Fungicide,"Used as a foliar fungicide on cereals, grapes, pome fruit, stone fruit, strawberries, vegetables, field crops and ornamentals; and as a seed dressing on barley."
6,Dicloran,Fungicide,"Widely used fungicide used on a variety of ornamentals, fruit and vegetable crops such as pricots, snap beans, carrots, celery, cherries, cucumber, endive, fennel, garlic, grapes, lettuce, nectarines, onions, peaches, plums, potatoes, prunes, rhubarb, shallots, sweet potatoes and tomatoes."


This looks alright! Let's move on then and find out which pesticide is yours.

## Model

To identify your pesticide we use the [nearest neighbor search](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) (also k-nearest neighbor or KNN). This is a relatively simple method that is often used to find objects that are similar to each other. The idea is that given an object, nearest neighbor search identifies objects (in your data) that are similar to the input object. This is exactly what we need in order to find the pesticide closest to you!

However, before we proceed to modeling, we have to bring your data and the descriptions of the pesticides into the correct form (preprocessing).

In [39]:
# preprocessing of the data (including yours)
pest_dfm <- pest_df %>%
  add_row(doc_id = nrow(pest_df) + 1, name = NA, type = NA, 
          text = paste(a_1, a_2, a_3, a_4, a_5)) %>%
  corpus() %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords(language = "en")) %>%
  tokens_wordstem(language = "en") %>%
  dfm() %>%
  dfm_tfidf()

# get an idea of the data structure
pest_dfm

Document-feature matrix of: 62 documents, 687 features (96.5% sparse) and 2 docvars.
    features
docs experiment compound       use   cereal     veget     fruit      crop
   1   1.792392 1.491362 0.1021956 1.190332 0.5882717 0.8243609 0.3610279
   2   0        0        0.1021956 0        1.1765434 0.4121804 0.3610279
   3   0        0        0         0        0         0         0        
   4   0        0        0.2043912 0        0.5882717 0.4121804 0        
   5   0        0        0.1021956 1.190332 0.5882717 0.8243609 0.3610279
   6   0        0        0.2043912 0        0.5882717 0.4121804 0.3610279
    features
docs  peanut     turf  ornament
   1 1.31527 1.093422 0.6163004
   2 0       0        0        
   3 0       0        0        
   4 0       1.093422 0.6163004
   5 0       0        0.6163004
   6 0       0        0.6163004
[ reached max_ndoc ... 56 more documents, reached max_nfeat ... 677 more features ]

Okay, let's get serious now and finally identify your pesticide!

In [40]:
# extract training set, i.e. the descriptions
pest_train <- pest_dfm[-nrow(pest_dfm), ]
# extract testing set, i.e. your data
pest_test <- pest_dfm[nrow(pest_dfm), ]

# extract predict information from training data, i.e. pesticide names 
pest_target <- pest_train$name

# run knn function, i.e. predict your pesticide
pest_pred <- class::knn(train = pest_train,
                        test = pest_test,
                        cl = pest_target, 
                        k = 1)

Eureka, that's it! Now we just have to print it.

In [42]:
#  extract pesticide information
pest_res <- pest_df[pest_df$name == as.character(pest_pred), ]

# learn your pesticide
IRdisplay::display_markdown(paste0("📣 **", pest_res$name, " (", tolower(pest_res$type), ")**: ", pest_res$text, " 😉"))

📣 **Fenpropathrin (insecticide)**: Insecticide used in agriculture and on ornamentals. Used to control mites in fruits and vegetables. 😉