# DSCI 100 Project Proposal

## Introduction

### Background

The goal of our project is to understand the difference between man made pollutants and compare and contrast it with sonar data that would be obtained from rocks. More specifically, we will look at the difference in sonar data from metallic cylinders alternatively known as mines. With accurate classification more can be known about the objects are within our oceans - simply by relying on sonar data. 

### Guiding Question

How can we classify an observation from sonar data to see whether it is a rock or a metal cylinder?

### Dataset

We will train the model using the Connectionist Bench (Sonar, Mines vs. Rocks) Data Set from the University of California Irvine. The data contains 111 patterns sonar information from mines and 97 patterns from rocks (with data points taken at different angles). This will allow us to write a classification algorithm differentiating between the two objects. 

Link to the dataset: https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)

## Preliminary exploratory data analysis 

### Loading libraries

To process the data, we will be using the `tidyverse`, `repr`, and `rvest` packages. 

In [4]:
library(tidyverse)
library(repr)
library(rvest)
library(cowplot

ERROR: Error in parse(text = x, srcfile = src): <text>:5:0: unexpected end of input
3: library(rvest)
4: library(cowplot
  ^


### Reading the data

We will read the data directly from the UCI Machine Learning Repository (link https://archive.ics.uci.edu/ml/index.php). 

The last 10 rows will be outputted as a preview showing the current state of the data collected.

It is worthwhile to note that there are 60 columns, each denoting the energy within a frequency band over a period of time. And, each observation is a number between 0 and 1. 

We also preprocess the data to move the column that we are classifying by to the first column rather than the end to make it more clear which each observation represents, thus making it easier during the tidying stage.

In [7]:
data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", col_names=FALSE)

data1 <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", col_names=FALSE) %>%
    select(-X61)

categories <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", col_names=FALSE) %>%
    pull(X61)

sonar_data <- tibble(categories, data1) %>%
    mutate(categories = as_factor(categories))

tail(sonar_data, 10) # only last 10 rows will be outputted as a preview

ERROR: Error in read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", : could not find function "read_csv"


### Tidying the data

In [8]:
data1 <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", col_names=FALSE) %>%
    select(-X61)

categories <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", col_names=FALSE) %>%
    pull(X61)

sonar_data <- tibble(categories, data1) %>%
    mutate(categories = as_factor(categories))

ERROR: Error in read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", : could not find function "%>%"


### Finding the mean
We plan to find the mean of each energy frequency as an indicator to classify by, and we will also do a second plot classifying by all the rows from X1:X60. 

In [9]:
average_data <- sonar_data %>%
    rowwise() %>%
    mutate(mean2 = mean(c_across(X31:X60)), `.after` = categories) %>%
    mutate(mean1 = mean(c_across(X1:X30)), `.after` = categories) %>%
    mutate(mean = mean(c_across(X1:X60)), `.after` = categories) 

tail(average_data, 5) # previewing the average plots

ERROR: Error in sonar_data %>% rowwise() %>% mutate(mean2 = mean(c_across(X31:X60)), : could not find function "%>%"


## Plot

We will compare and classify based on two plots, one plot being X1 and X2 and another plot being the mean frequency from the first 30 rows of the data and the mean frequency from the next 30 rows of data. Using the two plots, we will compare how accurate the knn model is with each to get a better idea whether to take the mean or to find sample frequencies. 

In [10]:
options(repr.plot.width = 10, repr.plot.height = 8)

average_plot <- ggplot(average_data, aes(x = mean1, y = mean2, , colour = categories)) +
    geom_point(alpha = 0.7) +
        labs(x = "Frequency from mean1",
             y = "Frequency from mean2",
            colour = "Category") +
        ggtitle("Frequency from mean 2 vs Frequency from mean 1") +
        theme(text = element_text(size = 20))

ERROR: Error in ggplot(average_data, aes(x = mean1, y = mean2, , colour = categories)): could not find function "ggplot"


In [11]:
sonar_data_plot <- sonar_data %>%
    ggplot(aes(x = X1, 
               y = X2, 
               colour = categories)) +
        labs(x = "Frequency from X1",
             y = "Frequency from X2",
            colour = "Category") +
        geom_point(size = 2.5) +
        ggtitle("X2 Frequency vs X1 Frequency") +
        theme(text = element_text(size = 20))

ERROR: Error in sonar_data %>% ggplot(aes(x = X1, y = X2, colour = categories)): could not find function "%>%"


In [5]:
options(repr.plot.width = 20, repr.plot.height = 8)

plot_grid(average_plot, sonar_data_plot)

ERROR: Error in plot_grid(average_plot, sonar_data_plot): could not find function "plot_grid"


# Method
To conduct our anaylsis we will look at the different columns,X1,X2 etc and comapare the values that are given off, in order to identify what values are given off certain objects when sonar apparatus at different angles is used. We will use scatterplots so that we are able to use classification methods to predict future values that may be obtained to identify whether it is a rock or a mine (metallic object). We have spilt the data into 2 sections: one conatining the times 1 to 30 and the other 31 to 60 in order to detrmine whether time affects the energy given off. 

## Expected Outcomes and Significance
After conducting our project we expect our analysis to be able to predict the types of objects that are under water. Scientists can use this data to map out the seafloors for sailors,divers,fishers,military etc. Our analysis could help scientits identify what objects are below by simply using our analysis as a result this saves times and ensures a safe route from our reliable analysis. It will help in preparation and allow for the best results, for example if fishers use the data they can identify the most benefitial place to set up their farms. This research can be used in the future by many people and could be a profit for many businesses like submarine companies, scientists seraching for valuable materials, or sailors navigating their routes. This study could eventually get people asking whether a more advanced data anaysis could be coded to specify the size, material and the likely shape of the object.
