Software Analytics and Machine Learning for Software Engineering (IN4334)

Software repositories archive valuable software engineering data, such as source code, execution traces, historical code changes, mailing lists, and bug reports. This data contains a wealth of information about a project’s status and history. Doing data science on software repositories, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain, and evolve complex software projects.

The goal of this seminar course is to give students an in-depth perspective of software analytics and machine learning for software engineering methods.

Structure

In a nutshell:

The course is composed of lectures, guest lectures, and paper discussions.
Each team of students (2 people) is responsible for one paper. The team presents the paper to the colleagues and answer their questions. In the end of the course, the team also delivers a clear summary of the paper. Read more about how to select the paper, how to prepare the presentation, and how to write the summary later in this page.
Each team proposes some new research in the topics of software analytics or machine learning and writes a short paper containing the plan.
In the end of the course, team present their research ideas.

Schedule

September 2nd: What's software analytics? What's machine learning for software engineering?

In this lecture, I will discuss what this course is all about. I will also explain the overall procedure, assessment, and projects.

As a preparation for this lecture, please read the following paper: A. E. Hassan and T. Xie, “Software intelligence: The future of mining software engineering data,” in Proceedings of the fse/sdp workshop on future of software engineering research, 2010, pp. 161–166.

September 7th: Machine Learning for Software Engineering

In this lecture, I will give you an introduction to the machine learning for software engineering topic.

As a preparation for this lecture, please:

Watch the video by Georgios here: https://gousios.org/courses/ml4se/
Read the following paper: Pradel, M., & Chandra, S. (2020). Neural software analysis. arXiv preprint arXiv:2011.07986.

September 9th: Diomidis Spinellis on software analytics

Diomidis Spinellis, Professor of Software Engineering in the Department of Management Science and Technology of the Athens University of Economics and Business, Professor of Software Analytics in the Department of Software Technology of the Delft University of Technology, and director of the Business Analytics Laboratory (BALab).

Diomidis is one of the most influential researchers in the field of empirical software engineering and software analytics. He will talk about his views on the topic.

No preparation needed

September 14th: Paper discussions (session 1)

Team 5: M. Behroozi, C. Parnin and T. Barik, "Hiring is Broken: What Do Developers Say About Technical Interviews?," 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 2019, pp. 1-9, doi: 10.1109/VLHCC.2019.8818836.
Team 3: Kui Liu, Dongsun Kim, Tegawende F. Bissyande, Shin Yoo, and Yves Le Traon: "Mining Fix Patterns for FindBugs Violations". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2884955.

September 16th: Greg Wilson on How to Run a Meeting

Greg Wilson is Ph.D. in Computer Science, Co-founder of Software Carpentry and The Architecture of Open Source Applications, ACM SIGSOFT Influential Educator of the Year Co-winner, Jolt Award, Best General Book Member, Python Software Foundation, Author or editor of over a dozen books on programming and two for children. Greg is also the editor of Never Work in Theory

Greg will talk about how to run a meeting, a fundamental skill in real-world software engineering.

No preparation needed

September 21st: Paper discussions (session 2)

Team 11: Pankaj Jalote and Damodaram Kamma: "Studying Task Processes for Improving Programmer Productivity". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2904230.
Team 12: M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” in International conference on learning representations, 2018.
Team 8: B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu, “On the" naturalness" of buggy code”

September 23rd: Paper discussions (session 3)

Team 2: Allamanis, M., Barr, E. T., Ducousso, S., & Gao, Z. (2020, June). Typilus: Neural type hints. In Proceedings of the 41st acm sigplan conference on programming language design and implementation (pp. 91-105).
Team 9: F. Kortum, J. Klünder and K. Schneider, "Behavior-Driven Dynamics in Agile Development: The Effect of Fast Feedback on Teams," 2019 IEEE/ACM International Conference on Software and System Processes
Team 15: Svyatkovskiy, A., Zhao, Y., Fu, S., & Sundaresan, N. (2019, July). Pythia: AI-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2727-2735).
Team 16: S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Summarizing source code using a neural attention model,” in Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), 2016, pp. 2073–2083.

September 28th: Georgios Gousios on ML at Facebook

At Facebook, around 50k software engineers and data scientistics work on a single monorepo. The BigCode team is capitalizing on the availability of rich historical data to automate software engineering tasks at scale. In our talk, we will first present a few tools (and their design) that have been developed by the BigCode team and then dive into TypeWriter, a ML tool to predict types for Python by learning them from the existing codebase.

Read this paper before the lecture: https://arxiv.org/pdf/1912.03768.pdf

(This talk will NOT be recorded)

September 30th: Paper discussions (session 4)

Team 1: Clement, Colin B., et al. "PyMT5: multi-mode translation of natural language and Python code with transformers." arXiv preprint arXiv:2010.03150 (2020).
Team 4: Jasmine Latendresse, Rabe Abdalkareem, Diego Elias Costa, and Emad Shihab: "How Effective is Continuous Integration in Indicating Single-Statement Bugs?". 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 10.1109/msr52588.2021.00062.
Team 7: Brittany Johnson, Thomas Zimmermann, and Christian Bird: "The Effect of Work Environments on Productivity and Satisfaction of Software Engineers". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2903053.
Team 14: Afnan A. Al-Subaihin, Federica Sarro, Sue Black, Licia Capra, and Mark Harman: "App Store Effects on Software Engineering Practices". IEEE Transactions on Software Engineering, 47(2), 2021, 10.1109/tse.2019.2891715.

October 5th: Guest lecture: Alexander Serebrenik on Gender biases in SE

Alexander Serebrenik is the full professor of social software engineering at TU Eindhoven. His research domain is empirical software engineering including both its social and technical aspects. He is particularly interested in understanding and supporting diversity of software development teams, their communication and collaboration. Alexander likes mining software repositories, conducting surveys and interviews and measuring software artefacts and processes.

Alexander will talk about his work on gender biases and this affects the software engineering world.

No preparation needed

October 7th: Chandra Maddila, Microsoft Research

There has been a fundamental shift amongst software developers and engineers in the past few years. The software development life cycle (SDLC) for a developer has increased in complexity and scale. Changes that were developed and deployed over a matter of days or weeks are now deployed in a matter of hours. Due to greater availability of compute, storage, better tooling, and the necessity to react, developers are constantly looking to increase their velocity and throughput of developing and deploying changes. Consequently, there is a great need for more intelligent and context sensitive DevOps tools and services that help developers increase their efficiency while developing and debugging. Given the vast amounts of heterogeneous data available from the SDLC, such intelligent tools and services can now be built and deployed at a large scale to help developers achieve their goals and be more productive. In this talk I will talk about the important components of doing machine learning for software engineering (ML4SE) at scale. I will discuss some of the important aspects of operationalizing, measuring the impact of machine learning models and recommendations. To that end, I will discuss a case study of a service named Nudge, which is designed to accelerate pull request completion time using machine learning.

Chandra Maddila works for the applied sciences group @ Microsoft Research in Redmond (USA). His current research interests are in the intersection of software engineering, developer productivity, and applications of machine learning, artificial intelligence for software engineering and developer productivity problems. Chandra is a cofounder of Project Sankie, a machine learning platform for software engineering (ML4SE) platform. Sankie is used extensively by thousands of developers and hundreds of product organizations at Microsoft. Chandra developed services and tools such as Nudge, ORCA, and ConE that are operationalized in thousands of repositories @ Microsoft. A Microsoft Research podcast explaining the summary of his research can be found here. Chandra’s work on bug localization received ‘Jay Lepreau best paper award @ USENIX OSDI 2018’. He also delivered an invited talk @ USENIX ATC. Chandra published his research in conferences/journals like ICSE, ESEC/FSE, TOSEM, OSDI, NSDI, CIKM, SIGIR, ACL, etc.

No preparation needed

October 12th: Paper discussions (session 5)

Team 6: Rak-amnouykit, I., McCrevan, D., Milanova, A., Hirzel, M., & Dolby, J. (2020). Python 3 types in the wild: a tale of two type systems. Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages. doi:10.1145/3426422.3426981
Team 10: van der Laan, N. (2021)."Deep Just-in-Time Defect Prediction at Adyen". MSc thesis, Delft University of Technology, TU Delft Education repository.
Team 13: M. Pradel, V. Murali, R. Qian, M. Machalica, E. Meijer, and S. Chandra, “Scaffle: Bug localization on millions of files,” in Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, 2020, pp. 225–236.

October 14th: No lecture

No lecture. Use this slot to work on your research project.

October 19th: Elvan Kula on software effort estimation

Elvan Kula is a doctoral candidate at TU Delft and chapter lead at ING. She focuses on using automated techniques to both understand and improve software development processes in terms of efficiency and predictability. She is the lab manager of AI for Fintech Research, a five year research collaboration between ING and TU Delft.

Elvan will talk about her recent work on predicting delays in software deliveries at ING.

No preparation needed

October 29th: Presentation day and project deadline

Each team will present their research project:

10-15 minutes presentation
10 minutes of questions

This session will be held on campus, lecture hall Chip!

Deliverables

Teams have two deliverables throughout the course:

A summary of a software analytics or machine learning for software engineering paper
The research project

Paper summary

Each team is responsible for one paper summary. This task is composed of:

One 10+5 (or 10+10 depending on the number of papers in that session) minutes presentation summarizing the paper to the other students. This presentation will be given in one of the slots in our schedule. The presentation should be composed of:
- Motivation of work: what problem does it solve? why is this an important problem?
- Approach: how does the paper do it?
- Results: what are the findings of the paper?
- Implications: how do the findings change the way we build software?
A summary of the paper in a blog post format.
- See summary format for more details on how this should be written
- We partnered up with the Never Work in Theory blog is an initiative of Greg Wilson. The blog, as described in its own page, contains "short summaries of recent results in empirical software engineering research". At the end of the course, your article might be published there (optional)! Do your best!

You are free to select a paper. Go for the topic that you are most interested! Some interesting lists of papers:

Never Work in Theory's TODO list. This list mostly contains papers on software analytics and empirical software engineering.
Any of the Machine Learning papers we discussed in last year's edition of this seminar, which was way more focused on ML.

Note that two teams cannot work on the same paper. The first team that picks the paper has the priority.

The deadline of the presentation is the day of your presentation. I will assign teams to slots randomly. The deadline of the summary is the last day of the course. Feel free to ask for feedback before the deadline.

Select your paper and pick your summary presentation day in our Google Spreadsheets. First come first served! See the link in our Mattermost.

Research project

You will be reading and watching many paper presentations. It is time for you to try out and propose some interesting software analytics or machine learning for software engineering research.

You can go for:

A completely new research idea that has nothing to do with the papers we discussed. Maybe you have some experience in SE and wants to better understand some phenomenon.
An extension of a paper we discussed. Maybe you saw a paper and thought: "hmm, I'd research it differently; I'd go for a controlled experiment rather than a mining study". Feel free to propose extensions of papers.

You should write a registered report that explains your research idea. See the instructions the MSR conference gives to its authors. This should be written in Markdown. Note the examples at the bottom of the page there. Also see an example of the Markdown file you should deliver.

Note that you are only going to write about the idea and its methodology; you are not going to execute the research itself here. Nevertheless, if you really appreciated your proposal and wants to do it, you are free to use this paper as a plan for your MSc thesis research!

You should also prepare a presentation of your research proposal, which you will give in the last day of the course.

The deadline of the research project is the day of the final presentation. Feel free to ask for feedback before that!

Assessment

50% paper summary
50% research report

We do not offer resits.

Course staff

Maurício Aniche

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.idea		.idea
research_project		research_project
README.md		README.md
research-project-format.md		research-project-format.md
summary-format.md		summary-format.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Software Analytics and Machine Learning for Software Engineering (IN4334)

Structure

Schedule

September 2nd: What's software analytics? What's machine learning for software engineering?

September 7th: Machine Learning for Software Engineering

September 9th: Diomidis Spinellis on software analytics

September 14th: Paper discussions (session 1)

September 16th: Greg Wilson on How to Run a Meeting

September 21st: Paper discussions (session 2)

September 23rd: Paper discussions (session 3)

September 28th: Georgios Gousios on ML at Facebook

September 30th: Paper discussions (session 4)

October 5th: Guest lecture: Alexander Serebrenik on Gender biases in SE

October 7th: Chandra Maddila, Microsoft Research

October 12th: Paper discussions (session 5)

October 14th: No lecture

October 19th: Elvan Kula on software effort estimation

October 29th: Presentation day and project deadline

Deliverables

Paper summary

Research project

Assessment

Course staff

About

Releases

Packages

tnaber/in4334-2021

Folders and files

Latest commit

History

Repository files navigation

Software Analytics and Machine Learning for Software Engineering (IN4334)

Structure

Schedule

September 2nd: What's software analytics? What's machine learning for software engineering?

September 7th: Machine Learning for Software Engineering

September 9th: Diomidis Spinellis on software analytics

September 14th: Paper discussions (session 1)

September 16th: Greg Wilson on How to Run a Meeting

September 21st: Paper discussions (session 2)

September 23rd: Paper discussions (session 3)

September 28th: Georgios Gousios on ML at Facebook

September 30th: Paper discussions (session 4)

October 5th: Guest lecture: Alexander Serebrenik on Gender biases in SE

October 7th: Chandra Maddila, Microsoft Research

October 12th: Paper discussions (session 5)

October 14th: No lecture

October 19th: Elvan Kula on software effort estimation

October 29th: Presentation day and project deadline

Deliverables

Paper summary

Research project

Assessment

Course staff

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages