Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(#307): information about samples filtering in tex/report.tex #309

Merged
merged 4 commits into from
May 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions tex/report.tex
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ \section{Motivation}\label{sec:motivation}
their research results, paper authors must somehow guarantee that the source
code used at the time of research remains available and intact throughout the
paper's lifetime. One obvious solution would be to make copies of the
repositories being extracted and then host them somewhere they are "forever"
repositories being extracted and then host them somewhere they are ``forever''
available.

Second, research methods typically involve filtering out certain types of files
Expand Down Expand Up @@ -134,8 +134,12 @@ \section{Methodology}\label{sec:method}
Python, Ruby, and Bash, which do exactly the following:
\begin{itemize}
\item Fetch open repositories from GitHub, which have \ff{java} language
tag, have reasonably big but not too big number of stars, and are
of certain minimum size;
tag, have reasonably big but not too big number of stars, and are of certain minimum size;
\item Filter out repositories that have license different from MIT or Apache License.
\item Filter out repositories those contain samples, instead real project,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h1alexbel this description is too vague, I believe. Maybe we can give a link to your Python repo here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yegor256 fixed

framework or library by using \ff{samples-filter}\footnote{\url{https://github.com/h1alexbel/samples-filter}}
that predicts using text classification to which class (real or sample)
repository belongs to.
\item Remove files without \ff{.java} extension, Java files with syntax errors,
supplementary files such as \ff{package-info.java} and \ff{module-info.java},
files with very long lines, and unit tests;
Expand All @@ -151,7 +155,6 @@ \section{Methodology}\label{sec:method}

We believe that our method is ethical, as it utilizes data from publicly
available sources, thereby avoiding any infringement of copyright.
% Would be great to include only repositories with MIT and Apache license, see https://github.com/yegor256/cam/issues/275

\section{Results}\label{sec:results}

Expand All @@ -160,6 +163,7 @@ \section{Results}\label{sec:results}
\iexec{cat "${TARGET}/temp/repo-details.tex"}
The full list of them is in the \ff{repositories.csv} file.
The \ff{hashes.csv} file has a list of Git hashes of their latest commits.
Predictions about each repository being sample or not located in \ff{predictions.csv} file.

The filtering process was the following:

Expand Down
Loading