**Group Project Proposal: Pulsar Star Classification**

**Introduction**

Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal

Clearly state the question you will try to answer with your project

Identify and describe the dataset that will be used to answer the question

Preliminary exploratory data analysis:

Demonstrate that the dataset can be read from the web into R 

Clean and wrangle your data into a tidy format

Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [None]:
## install packages
#install.packages("tidyverse")


In [3]:
## load packages
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔[39m [34mmodeldata   [39m 1.0.0     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.0     [32m✔[39m [34myardstick   [39m 1.0.0
[32m✔[39m [34mrecipes     [39m 1.0.1     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

In [47]:


## LOADING DATA ##

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip"
#Create temporary file to store zip file and download zip file
temp <- tempfile()
download.file(url, temp)

#read dataset (HTRU_2.csv) from zip file (temp) + give each column appropriate name
pulsar_data <- read_csv(unz(temp, "HTRU_2.csv"), col_names = c("Mean integrated profile", 
                                                         "Standard deviation integrated profile", 
                                                         "Excess kurtosis integrated profile", 
                                                         "Skewness integrated profile",
                                                         "Mean DM-SNR curve",
                                                         "Standard deviation DM-SNR curve",
                                                         "Excess kurtosis DM-SNR curve",
                                                         "Skewness DM-SNR curve",
                                                         "Class"))
#delete temporary file (because no longer needed)
unlink(temp)


## CLEANING & WRANGLING DATA INTO A TIDY FORMAT ##

# make the class column (that determines whether an observation is a pulsar or not) as a factor
pulsar_data <- pulsar_data %>% 
        mutate(Class = as_factor(Class))

#replace spaces in column names by dots
colnames(pulsar_data) = make.names(colnames(pulsar_data))

## EXPLORATORY DATA ANALYSIS ##

# splitting data into training and testing data

pulsar_split <- initial_split(pulsar_data, prop = .75, strata = Class)
pulsar_train <- training(pulsar_split)
pulsar_test <- testing(pulsar_split)

## summarize training data

# umber of observations in each class type
pulsar_train_summary <- pulsar_train %>% 
  group_by(Class) %>% 
  summarize( n = n())

pulsar_train_summary

# means of the predictor variables
pulsar_predictor_means <- pulsar_train %>% 
  select(-Class) %>% 
  summarize(across(everything(), mean))

pulsar_predictor_means

# number of rows with missing data

pulsar_missing_data <- pulsar_train %>% 
    filter_at(vars(all_of(colnames(pulsar_train))), any_vars(is.na(.)))

number_of_rows_missing <- nrow(pulsar_missing_data)
number_of_rows_missing  # there are no missing data in this dataset





[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): Mean integrated profile, Standard deviation integrated profile, Exc...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Class,n
<fct>,<int>
0,12192
1,1231


Mean.integrated.profile,Standard.deviation.integrated.profile,Excess.kurtosis.integrated.profile,Skewness.integrated.profile,Mean.DM.SNR.curve,Standard.deviation.DM.SNR.curve,Excess.kurtosis.DM.SNR.curve,Skewness.DM.SNR.curve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
110.958,46.52722,0.4827907,1.806578,12.5644,26.34006,8.299918,104.8452


**Methods**

Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?

Describe at least one way that you will visualize the results

**Expected outcomes and significance**

What do you expect to find?

What impact could such findings have?

What future questions could this lead to?