# **Final Group Report**
#### *Group 11: Calvin Choi, Ruby Liu, Muhan Yang, Leon Zhang*

In [130]:
# import R libraries
library(tidyverse)
library(mltools)
library(leaps)
library(ggplot2)
library(moderndive)
library(dplyr)
library(gridExtra)
library(reshape2)
library(caret)
library(glmnet)

# Introduction

GitHub is one of the largest collection of open source software in the world (Borges et al., 2016). One feature of hosted repositories in Github is that they allow GitHub users to ***star*** them, which is mainly used for fellow developers to show appreciation, manifest interest or satisfaction with the current project, or bookmark a certain repository for future utility (Begel et al., 2013). Past research has shown that factors such as programming languages, number of forks, number of commits, application domains, and so on may affect the number of stars a GitHub repository obtains, according to Borges et al. (2016).

Our project investigates potential statistical models on various factors that might relate to numbers of stars in popular GitHub repositories, based on a dataset of information of the most popular GitHub repositories. The dataset we are using is the "Most Popular Github Repositories (Projects)" dataset from Kaggle (URL: https://www.kaggle.com/datasets/donbarbos/github-repos/data), which contains a list of top GitHub project repositories by the number of stars. The data is collected using GitHub search API and the query function to obtain projects with over 167 stars.

The repositories dataset consists of 215,029 project repositories (i.e., rows/observations) and 24 columns (i.e., variables) in total. Of the 24 columns, there are five columns containing the integer type of data of repositories: `Size`, `Stars`, `Forks`, `Issues`, and `Watchers`, and the data type of the remaining 19 variables are characters. Of those 19 columns, there are nine binary variables that contain data coded as "True" or "False" - `Has.Issues`, `Has.Projects`, `Has.Downloads`, `Has.Wiki`, `Has.Pages`, `Has.Discussions`, `Is.Fork`, `Is.Archived`, and `Is.Template`. There are also two columns, consisting of temporal data as character data type, within the rest ten variables: `Created.At` and `Updated.At`.

For our purposes of the project, we aim to build a predictive model and an inferential model using the number of stars as the outcome variable, and figure out the most significant factors/characteristics of a repository that impact its number of stars received. Here are our two questions:
1. What characteristic(s) of a GitHub repository can help to predict the number of stars/likes the repository received?
2. Can we predict the popularity of the repository? Using the response variable: `Stars`.

The first one is mostly proposed for making predictions on the number of stars the top repositories received based on the data. This question is an exploratory one and is aiming to find a best predictive model for Stars, using the knowledge from model selection and evaluation.

***!!! SAY SOMETHING ABOUT THE SECOND ONE?***

## Data Description

The dataset we were given is called the repository dataset from Kaggle. It is a dataset that lists the top 215k projects by star with over 167 stars on Github. This dataset contains data pertaining to the characteristics such as the size, creation date and homepage associated to the repo as well as its characteristics that tell us about its current state such as whether or not it has issues or projects. The dataset has a good diversity of numerical, text and categorical data which can also be combined to create metrics.
The author of this dataset went through quite a tedious process to collect this data. They used the github search api and ran a query through it which would only allow the author to extract 1000 observations at a time. They were able to capture all the observation by changing the stars criteria by changing the min and max every time. This does raise the question on whether or not the author had trustworthy workflow to acquire the data as the repetition could compromise the data to human error.

***!!! REPEATED INFO FROM THE DESCRIPTION THAT IS IN THE INTRODUCTION***

### Breakdown of columns from Kaggle description

| Variable Name | Data Type | Summary |
| --- | --- | --- |
| **`Name`** | chr (character) | The name of the GitHub repository |
| **`Description`** | chr (character) | A brief textual description that summarizes the purpose or focus of the repository (may also include emojis) |
| **`URL`** | chr (character | The URL or web address that links to the GitHub repository, which is a unique identified for the repository |
| **`Created.At`** | dttm (date time) | The date and time when the repository was initially created on GitHub, in ISO 8601 format |
| **`Updated.At`** | dttm (date time) | The date and time of the most recent update or modification to the repository, in ISO 8601 format |
| **`Homepage`** | chr (character) | The URL to the homepage or landing page associated with the repository, providing additional information or resources |
| **`Size`** | dbl (double) | The size of the repository in bytes, indicating the total storage space used by the repository's files and data |
| **`Stars`** | dbl (double) | The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest |
| **`Forks`** | dbl (double) | The number of times the repository has been forked by other GitHub users |
| **`Issues`** | dbl (double) | The total number of open issues (items that can be created to plan, discuss, and track work) |
| **`Watchers`** | dbl (double) | The number of GitHub users who are "watching" or monitoring the repository for updates and changes |
| **`Language`** | chr (character) | The primary programming language |
| **`License`** | chr (character) | Information about the software license using a license identifier |
| **`Topics`** | chr (character) | A list of topics or tags associated with the repository, helping users discover related projects and topics of interest |
| **`Has.Issues`** | lgl (logical) | A boolean value indicating whether or not the repository has an issue tracker enabled (if true, then the repository has an issue tracker) |
| **`Has.Projects`** | lgl (logical) | A boolean value indicating whether the repository uses GitHub Projects to manage and organize tasks and work items |
| **`Has.Downloads`** | lgl (logical) | A boolean value indicating whether the repository offers downloadable files or assets to users |
| **`Has.Wiki`** | lgl (logical) | A boolean value indicating whether the repository has an associated wiki with additional documentation and information |
| **`Has.Pages`** | lgl (logical) | A boolean value indicating whether the repository has GitHub Pages enabled, allowing the creation of a website associated with the repository |
| **`Has.Discussions`** | lgl (logical) | A boolean value indicating whether the repository has GitHub Discussions enabled, allowing community discussions and information |
| **`Is.Fork`** | lgl (logical) | A boolean value indicating whether the repository is a fork of another repository (if false, then the repository is not a fork) |
| **`Is.Archived`** | lgl (logical) | A boolean value indicating whether the repository is archived (typically read-only and no longer actively maintained) |
| **`Is.Template`** | lgl (logical) | A boolean value indicating whether the repository is set up as a template |
| **`Default.Branch`** | chr (character) | The name of the default branch |

# Methods and Results

## Exploratory Data Analysis

### Data Wrangling

In [199]:
# reading from the web into R using a github link to csv dataset
link <- "https://raw.githubusercontent.com/splashhhhhh/stat301/main/repositories.csv"

# read data
data <- read.csv(link)

In [200]:
head(data)

Unnamed: 0_level_0,Name,Description,URL,Created.At,Updated.At,Homepage,Size,Stars,Forks,Issues,⋯,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,freeCodeCamp,freeCodeCamp.org's open-source codebase and curriculum. Learn to code for free.,https://github.com/freeCodeCamp/freeCodeCamp,2014-12-24T17:49:19Z,2023-09-21T11:32:33Z,http://contribute.freecodecamp.org/,387451,374074,33599,248,⋯,True,True,True,False,True,False,False,False,False,main
2,free-programming-books,:books: Freely available programming books,https://github.com/EbookFoundation/free-programming-books,2013-10-11T06:50:37Z,2023-09-21T11:09:25Z,https://ebookfoundation.github.io/free-programming-books/,17087,298393,57194,46,⋯,True,False,True,False,True,False,False,False,False,main
3,awesome,😎 Awesome lists about all kinds of interesting topics,https://github.com/sindresorhus/awesome,2014-07-11T13:42:37Z,2023-09-21T11:18:22Z,,1441,269997,26485,61,⋯,True,False,True,False,True,False,False,False,False,main
4,996.ICU,Repo for counting stars and contributing. Press F to pay respect to glorious developers.,https://github.com/996icu/996.ICU,2019-03-26T07:31:14Z,2023-09-21T08:09:01Z,https://996.icu,187799,267901,21497,16712,⋯,False,False,True,False,False,False,False,True,False,master
5,coding-interview-university,A complete computer science study plan to become a software engineer.,https://github.com/jwasham/coding-interview-university,2016-06-06T02:34:12Z,2023-09-21T10:54:48Z,,20998,265161,69434,56,⋯,True,False,True,False,False,False,False,False,False,main
6,public-apis,A collective list of free APIs,https://github.com/public-apis/public-apis,2016-03-20T23:49:42Z,2023-09-21T11:22:06Z,http://public-apis.org,5088,256615,29254,191,⋯,True,False,True,False,False,False,False,False,False,master


We can see that we may have some outliers in our data by looking at the median and mean values and comparing it to the max value. For example, observe the summary of `Size`.

In [201]:
na_counts <- colSums(is.na(data)) # check for NA values
print(na_counts)

           Name     Description             URL      Created.At      Updated.At 
              0               0               0               0               0 
       Homepage            Size           Stars           Forks          Issues 
              0               0               0               0               0 
       Watchers        Language         License          Topics      Has.Issues 
              0               0               0               0               0 
   Has.Projects   Has.Downloads        Has.Wiki       Has.Pages Has.Discussions 
              0               0               0               0               0 
        Is.Fork     Is.Archived     Is.Template  Default.Branch 
              0               0               0               0 


At this point, we can already recognize some columns that will either be out of the scope of this course's analysis or likely be noise in our models:

- `Name`: likely noise, too difficult to Bag of Words encode
- `Description`: likely noise, too difficult to Bag of Words encode
- `URL`: likely noise, could turn into binary (Yes/No URL)
- `Created.At`/`Updated.At`: could encode date, however that would be difficult given the scope what we have learned in this course
- `Homepage`: likely noise, could turn into binary (Yes/No URL)
- `Language`: could be meaningful but we would probably want to impute NA values
- `License`: again, could be meaningful but we would probably want to impute NA values
- `Topics`: formatted in a way where we can't handle the data (List of topics (str)), so for now we can remove it and hopefully we can deal with it later

For the `Default.Branch` column, most of its data is either main and master. Therefore, we will filter out the rest.

In [202]:
# filter default branch to contain only main and master
data <- data %>%
    filter(Default.Branch == "main" | Default.Branch == "master")

In [203]:
# Columns to drop
columns_to_drop <- c("Name", "Description", "URL", "Created.At", "Updated.At", "Homepage", "Language", "License", 'Topics')

# Create a new data frame excluding the specified columns
dropped_data <- data[, !(names(data) %in% columns_to_drop)]
head(dropped_data)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,387451,374074,33599,248,374074,True,True,True,False,True,False,False,False,False,main
2,17087,298393,57194,46,298393,True,False,True,False,True,False,False,False,False,main
3,1441,269997,26485,61,269997,True,False,True,False,True,False,False,False,False,main
4,187799,267901,21497,16712,267901,False,False,True,False,False,False,False,True,False,master
5,20998,265161,69434,56,265161,True,False,True,False,False,False,False,False,False,main
6,5088,256615,29254,191,256615,True,False,True,False,False,False,False,False,False,master


In [204]:
#features_to_scale <- c("Size", "Forks", "Issues", "Watchers")

#dropped_data[features_to_scale] <- as.data.frame(scale(dropped_data[features_to_scale]))

In [205]:
processed_data <-  dropped_data
head(processed_data)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,387451,374074,33599,248,374074,True,True,True,False,True,False,False,False,False,main
2,17087,298393,57194,46,298393,True,False,True,False,True,False,False,False,False,main
3,1441,269997,26485,61,269997,True,False,True,False,True,False,False,False,False,main
4,187799,267901,21497,16712,267901,False,False,True,False,False,False,False,True,False,master
5,20998,265161,69434,56,265161,True,False,True,False,False,False,False,False,False,main
6,5088,256615,29254,191,256615,True,False,True,False,False,False,False,False,False,master


We can calculate our sample size as follows:

$$n = \frac{Z^2 * p * (1-p)}{E^2}$$

Where:

- $n$ = required sample size
- $Z$ = $Z$-score corresponding to the desired confidence level (1.96 for 95% confidence level)
- $p$ = estimated proportion of the population (0.5 to account for maximum variability)
- $E$ = desired margin of error (0.05)

In [206]:
# performing the above calculation
n <- (1.96**2 * 0.5 * (1-0.5))/0.05**2
print(n)

[1] 384.16


### Conditional Distribution of `Stars` ###

In [207]:
cond_dist <- data |>
    summarize(
        "Min" = min(Stars),
        "1st Quartile" = quantile(Stars, 0.25),
        "2nd Quartile" = quantile(Stars, 0.5),
        "3rd Quartile" = quantile(Stars, 0.75),
        "Max" = max(Stars)
    )

cond_dist

Min,1st Quartile,2nd Quartile,3rd Quartile,Max
<int>,<dbl>,<dbl>,<dbl>,<int>
167,236,373,784,374074


Conditional distribution of the `Stars` variable of the 384 repositories. The top repository by `Stars` count being named _freeCodeCamp_ has 374074 `Stars`. The lowest repository by `Stars` in the sample has 167 `Stars`. The median of `Stars` is 373.

In [208]:
set.seed(2024)

# running a sampling of the data using the above size
sample_data <- processed_data %>%
  sample_n(size = n, replace = FALSE)
head(sample_data)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,69132,447,69,2,447,True,True,True,False,False,False,False,False,False,main
2,57956,874,116,28,874,True,True,True,False,False,True,False,False,False,main
3,799,198,111,14,198,True,True,True,True,False,False,False,False,False,master
4,128,2005,126,0,2005,False,False,True,False,False,False,False,True,False,master
5,7202,4538,115,31,4538,True,True,True,False,False,True,False,False,False,master
6,31,289,29,7,289,True,True,True,True,False,False,False,False,False,master


In [209]:
names <- c('Has.Issues', 'Has.Projects', 'Has.Downloads', 'Has.Wiki', 'Has.Pages', 'Has.Discussions', 'Is.Fork', 'Is.Archived', 'Is.Template', 'Default.Branch')
sample_data_factored <- sample_data %>%
  mutate(across(names, as.factor))

head(sample_data_factored)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
1,69132,447,69,2,447,True,True,True,False,False,False,False,False,False,main
2,57956,874,116,28,874,True,True,True,False,False,True,False,False,False,main
3,799,198,111,14,198,True,True,True,True,False,False,False,False,False,master
4,128,2005,126,0,2005,False,False,True,False,False,False,False,True,False,master
5,7202,4538,115,31,4538,True,True,True,False,False,True,False,False,False,master
6,31,289,29,7,289,True,True,True,True,False,False,False,False,False,master


# The prediction won't work because I commented out the data_s (above)

#### Predictive question (Needs to be updated):
We were going to first split the sample data containing 3,000 random selected samples into training set and testing set using a 70-30 ratio basis, use training set to determine a well-trained model using Linear Regression function, and test our model using the testing set, which contains 30% of 3000-observation sample data (i.e., 0.3*3000 = 900 observations) randomly selected from our dataset.

Using the forward stepwise selection, we can use the BIC (Bayesian Information Criterion) of each model to select the model, since we want the model to be predictive rather than generative. We can also plot the Cp plot of the model out and select the minimum Cp model. Also, BIC can be used to approximate the test MSE, without looking at the test data.

###  Implementation of a proposed model

In [210]:
set.seed(2024)

sample_data_factored$ID <- rownames(sample_data_factored)

training_dat <- sample_n(sample_data_factored, size = nrow(sample_data_factored) * 0.7, replace = F)

testing_dat <- anti_join(sample_data_factored, training_dat, by = "ID")

training_dat <- training_dat |> select(-"ID", -"Is.Fork")
testing_dat <- testing_dat |> select(-"ID", -"Is.Fork")

In [215]:
for_sel <- regsubsets(
    x = Stars ~ ., nvmax = 13,
    data = training_dat,
    method = "forward"
)

fwd_summary <- summary(for_sel)

fwd_summary_df <- tibble(
    n_input_variables = 1:13,
    RSQ = fwd_summary$rsq,
    RSS = fwd_summary$rss,
    ADJ.R2 = fwd_summary$adjr2,
    CP = fwd_summary$cp
)

In [216]:
fwd_summary_df

fwd_summary

n_input_variables,RSQ,RSS,ADJ.R2,CP
<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,1.017963e-22,1,914.339692
2,1,7.486775e-23,1,604.629238
3,1,2.4278480000000003e-23,1,21.034746
4,1,2.2898290000000003e-23,1,7.058459
5,1,2.256645e-23,1,5.217208
6,1,2.2339270000000003e-23,1,4.587487
7,1,2.213241e-23,1,4.193006
8,1,2.201449e-23,1,4.828057
9,1,2.195891e-23,1,6.1846
10,1,2.1950000000000002e-23,1,8.081525


Subset selection object
Call: regsubsets.formula(x = Stars ~ ., nvmax = 13, data = training_dat, 
    method = "forward")
13 Variables  (and intercept)
                     Forced in Forced out
Size                     FALSE      FALSE
Forks                    FALSE      FALSE
Issues                   FALSE      FALSE
Watchers                 FALSE      FALSE
Has.IssuesTrue           FALSE      FALSE
Has.ProjectsTrue         FALSE      FALSE
Has.DownloadsTrue        FALSE      FALSE
Has.WikiTrue             FALSE      FALSE
Has.PagesTrue            FALSE      FALSE
Has.DiscussionsTrue      FALSE      FALSE
Is.ArchivedTrue          FALSE      FALSE
Is.TemplateTrue          FALSE      FALSE
Default.Branchmaster     FALSE      FALSE
1 subsets of each size up to 13
Selection Algorithm: forward
          Size Forks Issues Watchers Has.IssuesTrue Has.ProjectsTrue
1  ( 1 )  " "  " "   " "    "*"      " "            " "             
2  ( 1 )  " "  " "   "*"    "*"      " "            " "      

In [224]:
# Identify the size of the model that minimizes Cp
cp_min = which.min(fwd_summary_df$CP)

# Find the name of the variables for the best model
selected_var <- names(coef(for_sel, cp_min))[-1]
selected_var <- c('Size','Forks', 'Issues','Watchers','Has.Projects','Has.Pages','Has.Discussions')

# Reduce dataset to only include the selected predictors
training_subset <- training_dat %>% select(all_of(selected_var),Stars)

# Train the predictive model
data_red_OLS <- lm(Stars ~ .,
  data = training_subset
)

# summary(data_red_OLS)

In [226]:
# use the trained model to predict the responses of the testing set
data_test_pred_red_OLS <- predict(data_red_OLS, newdata = testing_dat)

In [233]:
# build an additive predictive model
data_full_OLS <- lm(Stars ~ ., data = training_data)
# data_full_OLS

In [234]:
# obtain the (out-of-sample) predicted values
data_test_pred_full_OLS <- predict(data_full_OLS, newdata = testing_data)
# head(data_test_pred_full_OLS)

In [235]:
data_RMSE_models <- tibble(
  Model = "OLS Full Regression",
  RMSE = rmse(
    data_test_pred_full_OLS,
    testing_dat$Stars
  )
)
# data_RMSE_models

In [236]:
# compute the RMSE of predicted stars in testing set
data_RMSE_models <- rbind(
  data_RMSE_models,
  tibble(
    Model = "OLS Reduced Regression",
    RMSE = rmse(data_test_pred_red_OLS, testing_dat$Stars)
    )
  )
data_RMSE_models

Model,RMSE
<chr>,<dbl>
OLS Full Regression,1116.237
OLS Reduced Regression,2.548134e-13


## Discussion

### Prediction

The results showed that the full regression model had a better out-of-sample prediction performance compared to our reduced ones, which indicates that the full OLS regression model is better at making predictions when considering all factors.

However, note that this is only a one-time estimate of the true test RMSE based on a random split of the data. If we split the data in a different way or by a different ratio, we might be very likely to obtain a different result, given that the RMSE value difference between the full regression and the reduced regression is quite trivial. 

Also, since we tend to use simpler statistical model since we would like to have a balance between fit and parsimony when selecting models, we finally would pick the reduced regression model, since it has a similar RMSE compared to the full model, but includes less variables/predictors.

Future study might want to include different ways and ratios of splitting the data and see if a similar result can be obtained. Moreover, while making predictions using the current dataset, note that the dataset is relatively large and we only included a random selected sample from it (3000 observations out of 215,029) and it contains a lot of missing data, although we did not include them in our analysis. In addition, the dataset only focused on the most popular repositories on GitHub. Therefore, future studies can use other datasets with more diverse data in terms of the popularity of repositories, and see if the result and prediction can be generalized to the overall GitHub repository population at large.

## References

A. Begel, J. Bosch and M. -A. Storey, "Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder," in IEEE Software, vol. 30, no. 1, pp. 52-66, Jan.-Feb. 2013, doi: 10.1109/MS.2013.13.

H. Borges, A. Hora and M. T. Valente, "Understanding the Factors That Impact the Popularity of GitHub Repositories," 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Raleigh, NC, USA, 2016, pp. 334-344, doi: 10.1109/ICSME.2016.31. 
