# **Final Group Report**
#### *Group 11: Calvin Choi, Ruby Liu, Muhan Yang, Leon Zhang*

In [3]:
# import R libraries
library(tidyverse)
library(mltools)
library(leaps)
library(ggplot2)
library(moderndive)
library(dplyr)
library(gridExtra)
library(reshape2)
library(caret)
library(glmnet)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


ERROR: Error in library(mltools): there is no package called 'mltools'


# Introduction

GitHub is one of the largest collection of open source software in the world (Borges et al., 2016). One feature of hosted repositories in Github is that they allow GitHub users to ***star*** them, which is mainly used for fellow developers to show appreciation, manifest interest or satisfaction with the current project, or bookmark a certain repository for future utility (Begel et al., 2013). Past research has shown that factors such as programming languages, number of forks, number of commits, application domains, and so on may affect the number of stars a GitHub repository obtains, according to Borges et al. (2016).

Our project investigates potential statistical models on various factors that might relate to numbers of stars in popular GitHub repositories, based on a dataset of information of the most popular GitHub repositories. The dataset we are using is the "Most Popular Github Repositories (Projects)" dataset from Kaggle (URL: https://www.kaggle.com/datasets/donbarbos/github-repos/data), which contains a list of top GitHub project repositories by the number of stars. The data is collected using GitHub search API and the query function to obtain projects with over 167 stars.

The repositories dataset consists of 215,029 project repositories (i.e., rows/observations) and 24 columns (i.e., variables) in total. Of the 24 columns, there are five columns containing the integer type of data of repositories: `Size`, `Stars`, `Forks`, `Issues`, and `Watchers`, and the data type of the remaining 19 variables are characters. Of those 19 columns, there are nine binary variables that contain data coded as "True" or "False" - `Has.Issues`, `Has.Projects`, `Has.Downloads`, `Has.Wiki`, `Has.Pages`, `Has.Discussions`, `Is.Fork`, `Is.Archived`, and `Is.Template`. There are also two columns, consisting of temporal data as character data type, within the rest ten variables: `Created.At` and `Updated.At`.

For our purposes of the project, we aim to build a predictive model and an inferential model using the number of stars as the outcome variable, and figure out the most significant factors/characteristics of a repository that impact its number of stars received. Here are our two questions:
1. What characteristic(s) of a GitHub repository can help to predict the number of stars/likes the repository received?
2. Can we predict the popularity of the repository? Using the response variable: `Stars`.

The first one is mostly proposed for making predictions on the number of stars the top repositories received based on the data. This question is an exploratory one and is aiming to find a best predictive model for Stars, using the knowledge from model selection and evaluation.

***!!! SAY SOMETHING ABOUT THE SECOND ONE?***

## Data Description

The dataset we were given is called the repository dataset from Kaggle. It is a dataset that lists the top 215k projects by star with over 167 stars on Github. This dataset contains data pertaining to the characteristics such as the size, creation date and homepage associated to the repo as well as its characteristics that tell us about its current state such as whether or not it has issues or projects. The dataset has a good diversity of numerical, text and categorical data which can also be combined to create metrics.
The author of this dataset went through quite a tedious process to collect this data. They used the github search api and ran a query through it which would only allow the author to extract 1000 observations at a time. They were able to capture all the observation by changing the stars criteria by changing the min and max every time. This does raise the question on whether or not the author had trustworthy workflow to acquire the data as the repetition could compromise the data to human error.

***!!! REPEATED INFO FROM THE DESCRIPTION THAT IS IN THE INTRODUCTION***

### Breakdown of columns from Kaggle description

| Variable Name | Data Type | Summary |
| --- | --- | --- |
| **`Name`** | chr (character) | The name of the GitHub repository |
| **`Description`** | chr (character) | A brief textual description that summarizes the purpose or focus of the repository (may also include emojis) |
| **`URL`** | chr (character | The URL or web address that links to the GitHub repository, which is a unique identified for the repository |
| **`Created.At`** | dttm (date time) | The date and time when the repository was initially created on GitHub, in ISO 8601 format |
| **`Updated.At`** | dttm (date time) | The date and time of the most recent update or modification to the repository, in ISO 8601 format |
| **`Homepage`** | chr (character) | The URL to the homepage or landing page associated with the repository, providing additional information or resources |
| **`Size`** | dbl (double) | The size of the repository in bytes, indicating the total storage space used by the repository's files and data |
| **`Stars`** | dbl (double) | The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest |
| **`Forks`** | dbl (double) | The number of times the repository has been forked by other GitHub users |
| **`Issues`** | dbl (double) | The total number of open issues (items that can be created to plan, discuss, and track work) |
| **`Watchers`** | dbl (double) | The number of GitHub users who are "watching" or monitoring the repository for updates and changes |
| **`Language`** | chr (character) | The primary programming language |
| **`License`** | chr (character) | Information about the software license using a license identifier |
| **`Topics`** | chr (character) | A list of topics or tags associated with the repository, helping users discover related projects and topics of interest |
| **`Has.Issues`** | lgl (logical) | A boolean value indicating whether or not the repository has an issue tracker enabled (if true, then the repository has an issue tracker) |
| **`Has.Projects`** | lgl (logical) | A boolean value indicating whether the repository uses GitHub Projects to manage and organize tasks and work items |
| **`Has.Downloads`** | lgl (logical) | A boolean value indicating whether the repository offers downloadable files or assets to users |
| **`Has.Wiki`** | lgl (logical) | A boolean value indicating whether the repository has an associated wiki with additional documentation and information |
| **`Has.Pages`** | lgl (logical) | A boolean value indicating whether the repository has GitHub Pages enabled, allowing the creation of a website associated with the repository |
| **`Has.Discussions`** | lgl (logical) | A boolean value indicating whether the repository has GitHub Discussions enabled, allowing community discussions and information |
| **`Is.Fork`** | lgl (logical) | A boolean value indicating whether the repository is a fork of another repository (if false, then the repository is not a fork) |
| **`Is.Archived`** | lgl (logical) | A boolean value indicating whether the repository is archived (typically read-only and no longer actively maintained) |
| **`Is.Template`** | lgl (logical) | A boolean value indicating whether the repository is set up as a template |
| **`Default.Branch`** | chr (character) | The name of the default branch |

# Methods and Results

## Exploratory Data Analysis

### Data Wrangling

In [25]:
# reading from the web into R using a github link to csv dataset
link <- "https://raw.githubusercontent.com/splashhhhhh/stat301/main/repositories.csv"

# read data
data <- read.csv(link)

In [26]:
head(data)

Unnamed: 0_level_0,Name,Description,URL,Created.At,Updated.At,Homepage,Size,Stars,Forks,Issues,⋯,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,freeCodeCamp,freeCodeCamp.org's open-source codebase and curriculum. Learn to code for free.,https://github.com/freeCodeCamp/freeCodeCamp,2014-12-24T17:49:19Z,2023-09-21T11:32:33Z,http://contribute.freecodecamp.org/,387451,374074,33599,248,⋯,True,True,True,False,True,False,False,False,False,main
2,free-programming-books,:books: Freely available programming books,https://github.com/EbookFoundation/free-programming-books,2013-10-11T06:50:37Z,2023-09-21T11:09:25Z,https://ebookfoundation.github.io/free-programming-books/,17087,298393,57194,46,⋯,True,False,True,False,True,False,False,False,False,main
3,awesome,😎 Awesome lists about all kinds of interesting topics,https://github.com/sindresorhus/awesome,2014-07-11T13:42:37Z,2023-09-21T11:18:22Z,,1441,269997,26485,61,⋯,True,False,True,False,True,False,False,False,False,main
4,996.ICU,Repo for counting stars and contributing. Press F to pay respect to glorious developers.,https://github.com/996icu/996.ICU,2019-03-26T07:31:14Z,2023-09-21T08:09:01Z,https://996.icu,187799,267901,21497,16712,⋯,False,False,True,False,False,False,False,True,False,master
5,coding-interview-university,A complete computer science study plan to become a software engineer.,https://github.com/jwasham/coding-interview-university,2016-06-06T02:34:12Z,2023-09-21T10:54:48Z,,20998,265161,69434,56,⋯,True,False,True,False,False,False,False,False,False,main
6,public-apis,A collective list of free APIs,https://github.com/public-apis/public-apis,2016-03-20T23:49:42Z,2023-09-21T11:22:06Z,http://public-apis.org,5088,256615,29254,191,⋯,True,False,True,False,False,False,False,False,False,master


In [6]:
summary(data)

     Name           Description            URL             Created.At       
 Length:215029      Length:215029      Length:215029      Length:215029     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Updated.At          Homepage              Size               Stars       
 Length:215029      Length:215029      Min.   :        0   Min.   :   167  
 Class :character   Class :character   1st Qu.:      378   1st Qu.:   237  
 Mode  :character   Mode  :character   Median :     2389   Median :   377  
                                       Mean   :    54283   Mean   :  1115  
                                       3rd Qu.:    15282   3rd Qu.:   797  
     

We can see that we may have some outliers in our data by looking at the median and mean values and comparing it to the max value. For example, observe the summary of `Size`.

In [7]:
na_counts <- colSums(is.na(data)) # check for NA values
print(na_counts)

           Name     Description             URL      Created.At      Updated.At 
              0               0               0               0               0 
       Homepage            Size           Stars           Forks          Issues 
              0               0               0               0               0 
       Watchers        Language         License          Topics      Has.Issues 
              0               0               0               0               0 
   Has.Projects   Has.Downloads        Has.Wiki       Has.Pages Has.Discussions 
              0               0               0               0               0 
        Is.Fork     Is.Archived     Is.Template  Default.Branch 
              0               0               0               0 


At this point, we can already recognize some columns that will either be out of the scope of this course's analysis or likely be noise in our models:

- `Name`: likely noise, too difficult to Bag of Words encode
- `Description`: likely noise, too difficult to Bag of Words encode
- `URL`: likely noise, could turn into binary (Yes/No URL)
- `Created.At`/`Updated.At`: could encode date, however that would be difficult given the scope what we have learned in this course
- `Homepage`: likely noise, could turn into binary (Yes/No URL)
- `Language`: could be meaningful but we would probably want to impute NA values
- `License`: again, could be meaningful but we would probably want to impute NA values
- `Topics`: formatted in a way where we can't handle the data (List of topics (str)), so for now we can remove it and hopefully we can deal with it later

For the `Default.Branch` column, most of its data is either main and master. Therefore, we will filter out the rest.

In [8]:
# filter default branch to contain only main and master
data <- data %>%
    filter(Default.Branch == "main" | Default.Branch == "master")

In [27]:
# Columns to drop
columns_to_drop <- c("Name", "Description", "URL", "Created.At", "Updated.At", "Homepage", "Language", "License", 'Topics')

# Create a new data frame excluding the specified columns
dropped_data <- data[, !(names(data) %in% columns_to_drop)]
head(dropped_data)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,387451,374074,33599,248,374074,True,True,True,False,True,False,False,False,False,main
2,17087,298393,57194,46,298393,True,False,True,False,True,False,False,False,False,main
3,1441,269997,26485,61,269997,True,False,True,False,True,False,False,False,False,main
4,187799,267901,21497,16712,267901,False,False,True,False,False,False,False,True,False,master
5,20998,265161,69434,56,265161,True,False,True,False,False,False,False,False,False,main
6,5088,256615,29254,191,256615,True,False,True,False,False,False,False,False,False,master


In [28]:
features_to_scale <- c("Size", "Forks", "Issues", "Watchers")

dropped_data[features_to_scale] <- as.data.frame(scale(dropped_data[features_to_scale]))

In [29]:
processed_data <-  dropped_data
head(processed_data)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,0.4743299,374074,26.84285,1.06903622,93.41788,True,True,True,False,True,False,False,False,False,main
2,-0.05295532,298393,45.82565,0.04109144,74.46148,True,False,True,False,True,False,False,False,False,main
3,-0.07523044,269997,21.11945,0.11742398,67.34891,True,False,True,False,True,False,False,False,False,main
4,0.19008643,267901,17.10648,84.8516245,66.82391,False,False,True,False,False,False,False,True,False,master
5,-0.04738725,265161,55.67305,0.0919798,66.1376,True,False,True,False,False,False,False,False,False,main
6,-0.07003823,256615,23.34719,0.7789726,63.99702,True,False,True,False,False,False,False,False,False,master


We can calculate our sample size as follows:

$$n = \frac{Z^2 * p * (1-p)}{E^2}$$

Where:

- $n$ = required sample size
- $Z$ = $Z$-score corresponding to the desired confidence level (1.96 for 95% confidence level)
- $p$ = estimated proportion of the population (0.5 to account for maximum variability)
- $E$ = desired margin of error (0.05)

In [30]:
# performing the above calculation
n <- (1.96**2 * 0.5 * (1-0.5))/0.05**2
print(n)

[1] 384.16


In [34]:
set.seed(2024)

# running a sampling of the data using the above size
sample_data <- processed_data %>%
  sample_n(size = n, replace = FALSE)
head(sample_data)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1.0083022,479,-0.1481999,-0.17772848,-0.15932512,True,True,True,True,True,True,False,False,False,master
2,0.09745232,941,-0.1441773,-0.09630712,-0.04360444,True,True,True,True,False,False,False,False,False,master
3,-0.075276,211,-0.1490044,0.11233514,-0.22645313,True,True,True,True,False,True,False,False,False,master
4,-0.06397756,2166,0.1132712,0.69755123,0.26323069,True,True,True,True,True,False,False,False,False,master
5,-0.07454564,4917,0.3618698,-0.12175129,0.95229473,True,False,True,False,False,True,False,False,False,main
6,-0.07665271,309,-0.1248686,-0.18790616,-0.20190632,True,True,True,True,False,False,False,False,False,master


In [35]:
# Factorize the T/F data
sample_data_factored <- sample_data |> 
                        mutate(Has.Issues = ifelse(Has.Issues == "True", 1, 0),
                               Has.Projects = ifelse(Has.Projects == "True", 1, 0),
                               Has.Downloads = ifelse(Has.Downloads == "True", 1, 0),
                               Has.Wiki = ifelse(Has.Wiki == "True", 1, 0),
                               Has.Pages = ifelse(Has.Pages == "True", 1, 0),
                               Has.Discussions = ifelse(Has.Discussions == "True", 1, 0),
                               Is.Fork = ifelse(Is.Fork == "True", 1, 0),
                               Is.Archived = ifelse(Is.Archived == "True", 1, 0),
                               Default.Branch = ifelse(Default.Branch == "True", 1, 0),
                               Is.Template = ifelse(Is.Template == "True", 1, 0))
head(sample_data_factored)

## IF WE NEED I HAVE A TRAIN TEST SPLIT HERE WE CAN USE TOO
# train_index <- createDataPartition(sample_data_factored$Stars, p = 0.7, list = FALSE)

# # Split the data based on the indices
# train_data <- sample_data_factored[train_index, ]
# test_data <- sample_data_factored[-train_index, ]


# train_index <- sample(1:nrow(sample_data_factored), 0.7*nrow(sample_data_factored))
# train_data <- sample_data_factored[train_index, ]
# test_data <- sample_data_factored[-train_index, ]


# head(train_data)
# head(test_data)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1.0083022,479,-0.1481999,-0.17772848,-0.15932512,1,1,1,1,1,1,0,0,0,0
2,0.09745232,941,-0.1441773,-0.09630712,-0.04360444,1,1,1,1,0,0,0,0,0,0
3,-0.075276,211,-0.1490044,0.11233514,-0.22645313,1,1,1,1,0,1,0,0,0,0
4,-0.06397756,2166,0.1132712,0.69755123,0.26323069,1,1,1,1,1,0,0,0,0,0
5,-0.07454564,4917,0.3618698,-0.12175129,0.95229473,1,0,1,0,0,1,0,0,0,0
6,-0.07665271,309,-0.1248686,-0.18790616,-0.20190632,1,1,1,1,0,0,0,0,0,0


# VISUALIZATION PCKGE REQ'D

In [36]:
# melted_data <- melt(sample_data[, c("Size", "Forks", "Issues", "Watchers")])

# # Create a boxplot
# ggplot(melted_data, aes(x = variable, y = value)) +
#   geom_boxplot() +
#   labs(title = "Distribution of Size, Stars, Forks, Issues, and Watchers") +
#   xlab("Variable") +
#   ylab("Value")

# We can get rid of NA but should we keep 0s? They are still a value

In [37]:
# check NAs, 0s, and empty values
colSums(data==0)
colSums(data=="")

In [39]:
# # Filter out 0s and empty values
# data <- data %>%
#     filter(Size != 0, Forks != 0, Issues != 0) %>%
#     filter(Name != "")

Since the dataset contains huge number of observations, we are only going to use 3,000 observations randomly drawn from the entire data.

In [40]:
# # random selection of 3,000 observations
# data_s <- sample_n(data, 3000, replace = FALSE)

# The prediction won't work because I commented out the data_s (above)

#### Predictive question (Needs to be updated):
We were going to first split the sample data containing 3,000 random selected samples into training set and testing set using a 70-30 ratio basis, use training set to determine a well-trained model using Linear Regression function, and test our model using the testing set, which contains 30% of 3000-observation sample data (i.e., 0.3*3000 = 900 observations) randomly selected from our dataset.

Using the forward stepwise selection, we can use the BIC (Bayesian Information Criterion) of each model to select the model, since we want the model to be predictive rather than generative. We can also plot the Cp plot of the model out and select the minimum Cp model. Also, BIC can be used to approximate the test MSE, without looking at the test data.

###  Implementation of a proposed model

#### Prediction

In [7]:
# split the data into training and testing set
data_s$ID <- rownames(data_s)
training_data <- sample_n(data_s, size = nrow(data_s) * 0.70,
  replace = FALSE
)

testing_data <- anti_join(data_s,
  training_data,
  by = "ID"
)

training_data <- training_data %>% select(-"ID")
testing_data <- testing_data %>% select(-"ID")


In [10]:
# build an additive predictive model
data_full_OLS <- lm(Stars ~ ., data = training_data)
# data_full_OLS

ERROR: Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels


In [4]:
# obtain the (out-of-sample) predicted values
data_test_pred_full_OLS <- predict(data_full_OLS, newdata = testing_data)
# head(data_test_pred_full_OLS)

ERROR: Error in eval(expr, envir, enclos): object 'data_full_OLS' not found


In [5]:
# compute the Root Mean Squared Error (RMSE) using data from the test set 
data_RMSE_models <- tibble(
  Model = "OLS Full Regression",
  RMSE = rmse(
    data_test_pred_full_OLS,
    testing_data$Stars
  )
)
# data_RMSE_models

ERROR: Error in rmse(data_test_pred_full_OLS, testing_data$Stars): could not find function "rmse"


In [6]:
# select a reduced LR using the forward selection algorithm from training set
data_forward_sel <- regsubsets(
  x = Stars ~., nvmax = 9,
  data = training_data,
  method = "forward"
)
# data_forward_sel

data_fwd_summary <- summary(data_forward_sel)

data_fwd_summary <- tibble(
   n_input_variables = 1:9,
   RSS = data_fwd_summary$rss,
   BIC = data_fwd_summary$bic,
   Cp = data_fwd_summary$cp
)

ERROR: Error in regsubsets(x = Stars ~ ., nvmax = 9, data = training_data, method = "forward"): could not find function "regsubsets"


In [7]:
# Identify the size of the model that minimizes Cp
cp_min = which.min(data_fwd_summary$Cp)

# Find the name of the variables for the best model
selected_var <- names(coef(data_forward_sel, cp_min))[-1]

# Reduce dataset to only include the selected predictors
training_subset <- training_data %>% select(all_of(selected_var),Stars)

# Train the predictive model
data_red_OLS <- lm(Stars ~ .,
  data = training_subset
)

# summary(data_red_OLS)

ERROR: Error in eval(expr, envir, enclos): object 'data_fwd_summary' not found


In [8]:
# use the trained model to predict the responses of the testing set
data_test_pred_red_OLS <- predict(data_red_OLS, newdata = testing_data)

ERROR: Error in eval(expr, envir, enclos): object 'data_red_OLS' not found


In [9]:
# compute the RMSE of predicted stars in testing set
data_RMSE_models <- rbind(
  data_RMSE_models,
  tibble(
    Model = "OLS Reduced Regression",
    RMSE = rmse(data_test_pred_red_OLS, testing_data$Stars)
    )
  )
data_RMSE_models

ERROR: Error in eval(expr, envir, enclos): object 'data_RMSE_models' not found


## Discussion

### Prediction

The results showed that the full regression model had a better out-of-sample prediction performance compared to our reduced ones, which indicates that the full OLS regression model is better at making predictions when considering all factors.

However, note that this is only a one-time estimate of the true test RMSE based on a random split of the data. If we split the data in a different way or by a different ratio, we might be very likely to obtain a different result, given that the RMSE value difference between the full regression and the reduced regression is quite trivial. 

Also, since we tend to use simpler statistical model since we would like to have a balance between fit and parsimony when selecting models, we finally would pick the reduced regression model, since it has a similar RMSE compared to the full model, but includes less variables/predictors.

Future study might want to include different ways and ratios of splitting the data and see if a similar result can be obtained. Moreover, while making predictions using the current dataset, note that the dataset is relatively large and we only included a random selected sample from it (3000 observations out of 215,029) and it contains a lot of missing data, although we did not include them in our analysis. In addition, the dataset only focused on the most popular repositories on GitHub. Therefore, future studies can use other datasets with more diverse data in terms of the popularity of repositories, and see if the result and prediction can be generalized to the overall GitHub repository population at large.

## References

A. Begel, J. Bosch and M. -A. Storey, "Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder," in IEEE Software, vol. 30, no. 1, pp. 52-66, Jan.-Feb. 2013, doi: 10.1109/MS.2013.13.

H. Borges, A. Hora and M. T. Valente, "Understanding the Factors That Impact the Popularity of GitHub Repositories," 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Raleigh, NC, USA, 2016, pp. 334-344, doi: 10.1109/ICSME.2016.31. 
