# Final Group Report
#### *Group 11: Calvin Choi, Ruby Liu, Muhan Yang, Leon Zhang*

## Introduction

GitHub is one of the largest collection of open source software in the world (Borges et al., 2016). One feature of hosted repositories in Github is that they allow GitHub users to *star* them, which is mainly used for fellow developers to show appreciation, manifest interest or satisfaction with the current project, or bookmark a certain repository for future utility (Begel et al., 2013). Past research had shown that factors, such as programming languages, number of forks, number of commits, application domains, and so on, may affect the number of stars a GitHub repository obtains, according to Borges et al. (2016).

Our project further investigated potential statistical models on various factors that might relate to numbers of stars in popular GitHub repositories, based on a dataset of information of the most popular GitHub repositories. The dataset we are using is the Most Popular Github Repositories (Projects) dataset from Kaggle (URL: https://www.kaggle.com/datasets/donbarbos/github-repos/data), which contains a list of top GitHub project repositories by the number of stars. The data is collected using GitHub search API and the query function to obtain projects with over 167 stars.

The repositories dataset consists of 215,029 project repositories (i.e., rows/observations) and 24 columns (i.e., variables) in total. Of the 24 columns, there are five columns containing the integer type of data of repositories: Size, Stars, Forks, Issues, and Watchers, and the data type of the rest 19 variables are characters. Of the 19 columns, there are nine binary variables that contain data coded as "True" or "False" - Has.Issues, Has.Projects, Has.Downloads, Has.Wiki, Has.Pages, Has.Discussions, Is.Fork, Is.Archived, and Is.Template. There are also two columns, consisting temporal data as character data type, within the rest ten variables: Created.At and Updated.At.

For our purposes of the project, we aim to build a predictive model and an inferential model using the number of stars as the outcome variable, and figure out the most significant factors/characteristic of the repository on its number of stars received. Here are our two questions:
1. What characteristic(s) of the GitHub repository can help to predict the number of stars/likes the repository received?
2. INFERENTIAL QUESTION ----------------

The first one is mostly proposed for making predictions on the number of stars the top repositories received based on the data. This question is an exploratory one and is aiming to find a best predictive model for Stars, using the knowledge from model selection and evaluation.

SAY SOMETHING ABOUT THE SECOND ONE?

## Methods

In [None]:
# import R libraries
library(tidyverse)
library(GGally)
library(mltools)
library(leaps)

### Exploratory Data Analysis (EDA)

In [None]:
# github link to csv dataset
link <- "https://raw.githubusercontent.com/splashhhhhh/stat301/main/repositories.csv"

# read data
data <- read.csv(link)

# see first few lines of the dataset
# head(data)

In [None]:
# check NAs, 0s, and empty values
colSums(data==0)
colSums(data=="")

In [None]:
# Filter out 0s and empty values
data <- data %>%
    filter(Size != 0, Forks != 0, Issues != 0) %>%
    filter(Name != "")

For default branch column, most of its data is in main and master. Therefore, we will filter out the rest.

In [None]:
# filter default branch to contain only main and master
data <- data %>%
    filter(Default.Branch == "main" | Default.Branch == "master")

Since the dataset contains huge number of observations, we are only going to use 3,000 observations randomly drawn from the entire data.

In [None]:
# random selection of 3,000 observations
data_s <- sample_n(data, 3000, replace = FALSE)

### Plan

#### Predictive question:
We were going to first split the sample data containing 3,000 random selected samples into training set and testing set using a 70-30 ratio basis, use training set to determine a well-trained model using Linear Regression function, and test our model using the testing set, which contains 30% of 3000-observation sample data (i.e., 0.3*3000 = 900 observations) randomly selected from our dataset.

Using the forward stepwise selection, we can use the BIC (Bayesian Information Criterion) of each model to select the model, since we want the model to be predictive rather than generative. We can also plot the Cp plot of the model out and select the minimum Cp model. Also, BIC can be used to approximate the test MSE, without looking at the test data.

### Implementation

#### Prediction

In [None]:
# split the data into training and testing set
data_s$ID <- rownames(data_s)
training_data <- sample_n(data_s, size = nrow(data_s) * 0.70,
  replace = FALSE
)

testing_data <- anti_join(data_s,
  training_data,
  by = "ID"
)

training_data <- training_data %>% select(-"ID")
testing_data <- testing_data %>% select(-"ID")


In [None]:
# build an additive predictive model
data_full_OLS <- lm(Stars ~ .,
  training_data
)
# data_full_OLS

In [None]:
# obtain the (out-of-sample) predicted values
data_test_pred_full_OLS <- predict(data_full_OLS, newdata = testing_data)
# head(data_test_pred_full_OLS)

In [None]:
# compute the Root Mean Squared Error (RMSE) using data from the test set 
data_RMSE_models <- tibble(
  Model = "OLS Full Regression",
  RMSE = rmse(
    data_test_pred_full_OLS,
    testing_data$Stars
  )
)
# data_RMSE_models

In [None]:
# select a reduced LR using the forward selection algorithm from training set
data_forward_sel <- regsubsets(
  x = Stars ~., nvmax = 9,
  data = training_data,
  method = "forward"
)
# data_forward_sel

data_fwd_summary <- summary(data_forward_sel)

data_fwd_summary <- tibble(
   n_input_variables = 1:9,
   RSS = data_fwd_summary$rss,
   BIC = data_fwd_summary$bic,
   Cp = data_fwd_summary$cp
)

In [None]:
# Identify the size of the model that minimizes Cp
cp_min = which.min(data_fwd_summary$Cp)

# Find the name of the variables for the best model
selected_var <- names(coef(data_forward_sel, cp_min))[-1]

# Reduce dataset to only include the selected predictors
training_subset <- training_data %>% select(all_of(selected_var),Stars)

# Train the predictive model
data_red_OLS <- lm(Stars ~ .,
  data = training_subset
)

# summary(data_red_OLS)

In [None]:
# use the trained model to predict the responses of the testing set
data_test_pred_red_OLS <- predict(data_red_OLS, newdata = testing_data)

In [None]:
# compute the RMSE of predicted stars in testing set
data_RMSE_models <- rbind(
  data_RMSE_models,
  tibble(
    Model = "OLS Reduced Regression",
    RMSE = rmse(data_test_pred_red_OLS, testing_data$Stars)
    )
  )
data_RMSE_models

## Discussion

### Prediction

The results showed that the full regression model had a better out-of-sample prediction performance compared to our reduced ones, which indicates that the full OLS regression model is better at making predictions when considering all factors.

However, note that this is only a one-time estimate of the true test RMSE based on a random split of the data. If we split the data in a different way or by a different ratio, we might be very likely to obtain a different result, given that the RMSE value difference between the full regression and the reduced regression is quite trivial. 

Also, since we tend to use simpler statistical model since we would like to have a balance between fit and parsimony when selecting models, we finally would pick the reduced regression model, since it has a similar RMSE compared to the full model, but includes less variables/predictors.

Future study might want to include different ways and ratios of splitting the data and see if a similar result can be obtained. Moreover, while making predictions using the current dataset, note that the dataset is relatively large and we only included a random selected sample from it (3000 observations out of 215,029) and it contains a lot of missing data, although we did not include them in our analysis. In addition, the dataset only focused on the most popular repositories on GitHub. Therefore, future studies can use other datasets with more diverse data in terms of the popularity of repositories, and see if the result and prediction can be generalized to the overall GitHub repository population at large.

## References

A. Begel, J. Bosch and M. -A. Storey, "Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder," in IEEE Software, vol. 30, no. 1, pp. 52-66, Jan.-Feb. 2013, doi: 10.1109/MS.2013.13.

H. Borges, A. Hora and M. T. Valente, "Understanding the Factors That Impact the Popularity of GitHub Repositories," 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Raleigh, NC, USA, 2016, pp. 334-344, doi: 10.1109/ICSME.2016.31. 
