# Project Proposal - Group 4

By Tianne Lee, Steven Tsai, Chloe Zhang, Jinghan Xu

## 1. **Introduction**

From elliptical, flat leaves to spiky, needle shaped, plant leaves have evolved to distinct shapes and sizes to adapt the wide ranges of living conditions earth has to offer. However, some leaves still look similar despite coming from very different plant species. In our project, we are interested in comparing the leaves of two different plant species, *Betula pubescens* (class 9 in the dataset) and *Tilia tomentosa* (class 10 in the dataset).

<p align="center">
<img src="data/leaf9_leaf10.png" alt="image source: ReadMe.pdf" width="40%"/>
<figcaption align = "center">Figure 1.1 - photo of leaves of Betula pubescens (left) and Tilia tomentosa (right)</figcaption>
</p>

In Figure 1.1, we can see that the two leaves share similar features. The goal of our study is to find out whether the mean smoothness<sup>[1](#myfootnote1)</sup> and solidity of Tilia tomentosa (class 10) is larger than Betula pubescens (class 9).

We will use a leaf dataset retrived from the [UCI Machine Learning Respotory](https://archive.ics.uci.edu/ml/datasets/Leaf). This dataset contains 16 shape and texture features of 40 plant species. We will only focus on the Tilia tomentosa & Betula pubescens species, and solidity & smoothness features.

<a name="myfootnote1">1</a>: Smoothness ranges between 0 and 1, with the value increasing as roughness increases.

## 2. **Methods and Results**

In [None]:
library("tidyverse")
library("dplyr")
library("broom")
library("infer")

### 2.1 Preliminary Results

Read dataset and add column names for readability:

In [None]:
raw_leaf <- read_csv("data/leaf.csv", show_col_types = FALSE, col_names = c(
    "Class", "Specimen_Number", "Eccentricity", "Aspect_Ratio",
    "Elongation", "Solidity", "Stochastic_Convexity", "Isoperimetric_Factor",
    "Maximal_Indentation_Depth", "Lobedness", "Average_Intensity", "Average_Contrast", # nolint
    "Smoothness", "Third_moment", "Uniformity", "Entropy"))
head(raw_leaf)

<figcaption align = "center">Table 2.1.1 - first 6 rows of the raw data set with readable column names</figcaption>
</p>

Clean and wrangle the data set. Drop irrelevant columns and rows. Keep `Solidity` and `Smoothness` as the response variables and extract the rows of class 9 and class 10. The summary of the wrangled data set is presented in Table 2.1.2.

In [None]:
leaf <- raw_leaf %>% 
    mutate(Class = as.factor(Class)) %>% 
    select(Class, Solidity, Smoothness) %>% 
    filter(Class == 9 | Class == 10)
summary(leaf)

<figcaption align = "center">Table 2.1.2 - Summary of the cleaned data set</figcaption>
</p>

Compute the mean and standard deviation of Solidity:

In [None]:
solidity_estimates <- leaf %>%
    group_by(Class) %>%
    summarise(
        mean = mean(Solidity),
        sd = sd(Solidity)
    )
solidity_estimates

<figcaption align = "center">Table 2.1.3 - Mean and standard deviation of Solidity for Class 9 and 10</figcaption>
</p>

Compute the mean and standard deviation of Smoothness:

In [None]:
smoothness_estimates <- leaf %>%
    group_by(Class) %>%
    summarise(
        mean = mean(Smoothness),
        sd = sd(Smoothness)
    )
smoothness_estimates

<figcaption align = "center">Table 2.1.4 - Mean and standard deviation of Smoothness for Class 9 and 10</figcaption>
</p>

Plot the distribution of solidity and smoothness of the two categories:

In [None]:
solidity_dist <- leaf|> 
    ggplot(aes(x = Solidity, fill = Class)) +
    geom_histogram(binwidth = 0.01, alpha = 0.4)+
    ggtitle("Distribution of Solidity")+
    geom_vline(data = solidity_estimates, aes(xintercept = mean, color = Class))

smoothness_dist <- leaf|> 
    ggplot(aes(x = Smoothness, fill = Class)) +
    geom_histogram(binwidth = 0.01, alpha = 0.4)+
    ggtitle("Distribution of Smoothness")+
    geom_vline(data = smoothness_estimates, aes(xintercept = mean, color = Class))

In [None]:
solidity_dist

<figcaption align = "center">Figure 2.1.1 - Sample distribution of Solidity for Class 9 and 10 </figcaption>
</p>

In [None]:
smoothness_dist

<figcaption align = "center">Figure 2.1.2 - Sample distribution of Smoothness for Class 9 and 10 </figcaption>
</p>

### 2.2 Methods

**Reservation for the plots and estimates**: The sample size of the 2 classes we are interested in are 14 and 13 respectively which are not large enough for us to find the specific numerical patterns and get a convincing result. The two variances of the data are large, resulting in less precise of the outcome.

**Next, making some adjustments for our dataset**. To improve the precision of our findings with the limited observations, we are going to conduct an enhanced experimental analysis. By bootstrap sampling the original data (with sample size of 14 and 13 respectively and 500 repeats) & using the asymptotics method, we can improve the precision of our previous findings. 

Now, let's test our expected preliminary findings, tomentosa (class 10) is smoother and more solid than Betula pubescens (class 9). The **hypothesis** are as follow: (Set the significant level for all the hypothesis testing: 0.05, and we need to find 90% CI for each distribution)
-  $H_{10}$: the mean solidity of Tilia tomentosa (class 10) is same as Betula pubescens (class 9).
-  $H_{1A}$: the mean solidity of Tilia tomentosa (class 10) is greater Betula pubescens (class 9).

-  $H_{20}$: the mean smoothness of Tilia tomentosa (class 10) is same as Betula pubescens (class 9).
-  $H_{2A}$: the mean smoothness of Tilia tomentosa (class 10) is greater Betula pubescens (class 9).



# Describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
# Make sure to interpret the results you obtain. It’s not enough to just state what a 90% confidence interval is, for example.
# Ensure your tables and/or figures are labeled with a figure/table number.
# Do you think one of bootstrapping or asymptotics is more appropriate than the other? Why or why not? Explain why you think both methods gave you similar/different results, and whether you think one is more trustworthy than the other.

(subject to modify) we are testing the solidity/smoothness of two types of leaves, thus we will test the differences between teh two independent means (or using two sample t test)
...(bullshit more)
...(add more markdown cells to elaborate: first t test for solidity, second for smoothness)

In [None]:
solidity_mean_diff  <- leaf %>% 
    specify(formula = Solidity ~ Class) %>% 
    calculate(stat = "diff in means", order = c("10", "9"))

solidity_mean_diff

smoothness_mean_diff  <- leaf %>% 
    specify(formula = Smoothness ~ Class) %>% 
    calculate(stat = "diff in means", order = c("10", "9"))

smoothness_mean_diff

In [None]:
set.seed(1234)

bootstrap_solidity <- 
     leaf %>% 
     specify(formula = Solidity ~ Class) %>% 
     hypothesize(null = "independence") %>% 
     generate(type = "permute", reps = 10000) %>% 
     calculate(stat = "diff in means", order = c("10", "9"))

head(bootstrap_solidity)

bootstrap_smoothness <- 
     leaf %>% 
     specify(formula = Smoothness ~ Class) %>% 
     hypothesize(null = "independence") %>% 
     generate(type = "permute", reps = 10000) %>% 
     calculate(stat = "diff in means", order = c("10", "9"))

head(bootstrap_smoothness)

In [None]:
p_value_solidity = bootstrap_solidity %>% get_p_value(obs_stat = solidity_mean_diff, direction = "right")
p_value_solidity # p value < 0.01

p_value_smoothness = bootstrap_solidity %>% get_p_value(obs_stat = smoothness_mean_diff, direction = "right")
p_value_smoothness # p value < 0.01

In [None]:
shade_pva_solidity <- 
bootstrap_solidity %>%
visualize() +
shade_p_value(solidity_mean_diff, direction = "right")
shade_pva_solidity

shade_pva_smoothness <-
bootstrap_smoothness%>%
visualize() +
shade_p_value(smoothness_mean_diff, direction = "right")
shade_pva_smoothness

In [None]:
solidity_ci_0.9 <- bootstrap_solidity%>%
get_confidence_interval(level = 0.9)
solidity_ci_0.9

smoothness_ci_0.9 <- bootstrap_smoothness%>%
get_confidence_interval(level = 0.9)
smoothness_ci_0.9

In [None]:
shade_ci_solidity <- 
bootstrap_solidity %>%
visualize() +
shade_confidence_interval(endpoints = solidity_ci_0.9)
shade_ci_solidity

shade_ci_smoothness <-
bootstrap_smoothness%>%
visualize() +
shade_confidence_interval(endpoints = smoothness_ci_0.9)
shade_ci_smoothness

## 3. **Discussion**


# In this section, you’ll interpret the results you obtained in the previous section with respect to the main question/goal of your project.

# Summarize what you found, and the implications/impact of your findings.
# If relevant, discuss whether your results were what you expected to find.
# Discuss future questions/research this study could lead to.

**About trustworthy**, since we have resampled for many times which eliminates the variance, the precision of the findings should be higher than directly plotting the observation using the limited data. And we set the significance level $\alpha$ = 0.05 in case of rejecting the null hypothesis while it is true.

**Future question**, as we find the relation of smoothness and solidity between two leaves. Will there be any evidence showing that smoothness can lead to solidity, or vise versa?

In [None]:
t_test_solidity <- tidy(
    t.test(
        x = leaf %>% filter(Class==10) %>% pull(Solidity),
        y = leaf %>% filter(Class==9) %>% pull(Solidity),
        alternative = 'greater'
    )
)
t_test_solidity

In [None]:
t_test_smoothness <- tidy(
    t.test(
        x = leaf %>% filter(Class==10) %>% pull(Smoothness),
        y = leaf %>% filter(Class==9) %>% pull(Smoothness),
        alternative = 'greater'
    )
)
t_test_smoothness

From our Solidity t test result, we gained a p-value of 0.001049488, which is smaller than our $\alpha$ value. Therefore, we reject the null hypothesis and conclude that the mean solidity of Tilia tomentosa (class 10) is greater Betula pubescens (class 9).

From our Smoothness t test result, we gained a p-value of 3.660258e-05, which is smaller than our $\alpha$ value. Therefore, we reject the null hypothesis and conclude that the mean smoothness of Tilia tomentosa (class 10) is greater Betula pubescens (class 9).

## 4. **References**

"Evaluation of Features for Leaf Discrimination”, Pedro F. B. Silva, Andre R.S. Marcal,
Rubim M. Almeida da Silva (2013), Springer Lecture Notes in Computer Science, Vol.
7950, 197-204.

“Development of a System for Automatic Plant Species Recognition”, Pedro Filipe Silva,
Disserta ̧c ̃ao de Mestrado (Master’s Thesis), Faculdade de Ciˆencias da Universidade do
Porto. Available for download or online reading at http://hdl.handle.net/10216/67734
