# Task 1: Instructions

Print out the raw data in the Git log file to show the available format.

- Print out the content of the sample file `git_log_excerpt.csv`.

## Good to know

This Project requires that you know your way around Python and Pandas. We recommend that you have complete these DataCamp courses before doing this project:

- [Intermediate Python for Data Science](https://www.datacamp.com/courses/intermediate-python-for-data-science).
- [Data Manipulation with pandas](https://www.datacamp.com/courses/data-manipulation-with-pandas).
- [Manipulating Time Series Data in Python](https://www.datacamp.com/courses/manipulating-time-series-data-in-python).

# Task 2: Instructions

Read in the Linux Git log file with Pandas.

- Load in the `pandas` module as `pd`.
- Read in the log file `git_log.gz`. Name the 1st column "`timestamp`" and the 2nd column "`author`".
- Assign the resulting DataFrame to `git_log`.
- Print out the first five rows of `git_log`.
- The `pandas` method `read_csv` can read a CSV file compressed in a `gz` file. You will have to specify the `sep`, `encoding`, `header`, and `names` arguments to `read_csv`.

# Task 3: Instructions

Gather some basic metrics about Linux's Git repository.

- Count the number of commits in `git_log`.
- Count the number of all contributing authors. Leave out the entries that don't have an author at all.

Here, use some basic functions of Python, Pandas' `DataFrame` and `Series` to count values and to remove missing data.

# Task 4: Instructions

List the ten authors that made the most commits.

- Count how often each author occurs in `git_log`, pick out the top ten authors, and assign the result to `top_10_authors`.

In this task, the result that is stored in `top_10_authors` has to be a `Series` or a `DataFrame` that includes the authors and the number of commits that each author has made.

# Task 5: Instructions

Transform the numbers in timestamp to time series-based data type.

- Convert the `timestamp` column to a Pandas' `Timestamp` type.
- Look at a summary of the converted `timestamp` column to check if the conversion was successful and if the boundary values make sense.

Here is the [official Pandas documentation for how to convert these type of time stamps](http://pandas-docs.github.io/pandas-docs-travis/timeseries.html#epoch-timestamps) (called _epoch_ time stamps) to `Timestamp`. Be sure to set the right `unit` of time (in our case: seconds) to the date conversion method. To summarize the resulting `Timestamp` column you could use the `describe()` method.

# Task 6: Instructions

Determine a right time period and keep only those commits within this time period.

- Pick a reasonable _first_ timestamp and assign it to `first_commit_timestamp`.
- Pick a reasonable _last_ timestamp for this dataset from late 2017 and assign it to `last_commit_timestamp`.
- Create a new `DataFrame` called `corrected_log`.
- Use `describe()` on `corrected_log['timestamp']` to check the data.

A possible valid time period:

- The _first_ reasonable entry is the first commit from Linus Torvalds.
- Every commit before the year 2018 would be a reasonable _last_ timestamp.

# Task 7: Instructions

Count the number of commits of the `corrected_log` for each year:

- Create a new `DataFrame` called `commits_per_year` that sums up all commits annually, starting at January 1st.
- Show the first five rows of the `DataFrame`.

There are many ways to accomplish this with Pandas. Use the `groupby` method with the utility function `Grouper` to group by year:

```
my_data_frame.groupby(pd.Grouper(key='my_timestamp_column',
                                 freq='AS'))
```

Here, `freq='AS'` makes `groupby` group by year using the 1st of January as starting day.

# Task 8: Instructions

Visualize the yearly counts using a suitable plot.

- Plot `commits_per_year` using the `pandas` `plot` method.
- Add a suitable `title`.
- Turn the `legend` off.

The `plot` method in `pandas` takes many options that allow you to customize your plot. Here is the [official documentation for `plot` for Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html). That documentation contains a lot of info, but the arguments you might want to add here are `kind`, `title`, and `legend`.

# Task 9: Instructions

Thanks for doing the project! As a last task:

- Set `year_with_most_commits` to the year with the most commits to Linux (as of autumn 2017).

## Further Reading

If you are more interested in mining software repositories, take a look at the following books:

- Adam Tornhill: Software X-Ray. Pragmatic Programmers, 2018.
- Christian Bird, Tim Menzies, Thomas Zimmermann: The Art and Science of Analyzing Software Data. Morgan Kaufmann, 2015.
- Tim Menzies, Laurie Williams, Thomas Zimmermann: Perspectives on Data Science for Software Engineering. Morgan Kaufmann, 2016.