In [None]:
spark

In [2]:
display(
    div(
        font[size: 18, color: "purple"]("GitHub 💖 .NET for Apache Spark"),
        hr(),
        div(
            font[size: 4]("Welcome to the Spark.NET Notebook E2E Playground!"),
            p(),
            font[size:4]("Let's dive into the world of analyzing GitHub meta-data using .NET for Apache Spark."),
            p(),
            font[size:4]("You can see here that we're able to use"),font[size:4, color:"purple"](" HTML "),
            font[size:4]("to render beautiful cells.")
        )
    ));

## <font color=purple>Read Input Data 📈 </font>
We'll read in data about various GitHub activity to understand trends and gain some interesting insights.
### We can start off reading a csv file containing commit information. Be sure to update the following cells with the correct data file paths.

In [3]:
DataFrame commits = spark
            .Read()
            .Option("header", "true")
            .Schema("id INT, sha STRING, author_id INT, committer_id INT, " +
                    "project_id INT, created_at TIMESTAMP")
            .Csv("<path_to_ghtorrent/commits.csv>");

commits.Show()

+---+--------------------+---------+------------+----------+-------------------+
| id|                 sha|author_id|committer_id|project_id|         created_at|
+---+--------------------+---------+------------+----------+-------------------+
|  2|397238b49c88d1d8e...|        2|           2|         1|2012-08-01 13:25:36|
|  3|55bf5367875ec9e81...|        2|           2|         1|2012-06-18 03:39:30|
|  4|9d653ea84c6df1b90...|        2|           2|         1|2012-06-11 07:47:16|
|  5|fd00ce155f1a9842d...|        2|           2|         1|2012-06-11 07:45:07|
|  6|641b94c68ecd63478...|        2|           2|         1|2012-05-07 06:00:56|
|  7|3f7d2c8f8dd589222...|        2|           2|         1|2012-03-08 04:47:19|
|  8|8adee7b7a7d634630...|        2|           2|         1|2012-03-08 04:40:43|
|  9|ecb30132a2d978a70...|        2|           2|         1|2012-03-08 04:40:25|
| 10|c1d057e040786c909...|        2|           2|         1|2012-03-08 04:24:22|
| 11|eefadec57441a172d...|  

### Next, let's find out more about the watchers.

In [4]:
DataFrame watchers = spark
            .Read()
            .Option("header", "true")
            .Schema("repo_id INT, user_id INT, created_at TIMESTAMP")
            .Csv("<path_to_ghtorrent/watchers.csv>");

watchers.Show()

+-------+-------+-------------------+
|repo_id|user_id|         created_at|
+-------+-------+-------------------+
|      1|      2|2009-12-08 10:17:27|
|      1|      4|2009-12-08 10:17:27|
|      1|      6|2010-02-05 06:35:04|
|      1|      7|2009-12-08 10:17:27|
|      1|      8|2010-03-26 04:57:36|
|      1|      9|2009-12-08 10:17:27|
|      1|     10|2009-12-08 10:17:27|
|      1|     11|2009-12-08 10:17:27|
|      1|     12|2009-12-08 10:17:27|
|      1|     13|2009-12-08 10:17:27|
|      1|     14|2010-11-24 05:43:29|
|      1|     15|2011-03-11 09:22:49|
|      1|     16|2011-06-27 06:22:45|
|      1|     17|2009-12-08 10:17:27|
|      1|     18|2010-12-09 13:14:35|
|      1|     19|2012-02-22 16:37:46|
|      1|     20|2012-03-25 02:58:07|
|      1|     21|2011-02-14 09:39:32|
|      1|     22|2012-07-24 11:45:58|
|      1| 155317|2012-12-09 12:19:08|
+-------+-------+-------------------+
only showing top 20 rows

### And last but not least, it's time to get our projects!
Let's gather all the C# projects in our data.

In [5]:
DataFrame projects = spark
            .Read()
            .Option("header", "true")
            .Schema("id INT, url STRING, owner_id INT, name STRING, " +
                    "descriptor STRING, language STRING, created_at STRING, " +
                    "forked_from INT, deleted STRING, updated_at STRING")
            .Csv("<path_to_ghtorrent/projects.csv>")
            .Filter(Col("language") == "C#");

projects.Show()

+----+--------------------+--------+--------------------+--------------------+--------+-------------------+-----------+-------+-------------------+
|  id|                 url|owner_id|                name|          descriptor|language|         created_at|forked_from|deleted|         updated_at|
+----+--------------------+--------+--------------------+--------------------+--------+-------------------+-----------+-------+-------------------+
| 582|https://api.githu...|    4398|                mono|Mono open source ...|      C#|2012-08-02 11:32:05|        581|      1|2015-10-12 15:03:36|
| 685|https://api.githu...|    5039|           opencover|A code coverage t...|      C#|2014-01-06 12:50:20|    7312471|      0|2019-05-17 13:59:43|
| 689|https://api.githu...|    5043|           opencover|A code coverage t...|      C#|2012-08-02 15:14:54|    7312471|      0|2016-10-11 14:17:00|
| 794|https://api.githu...|    5296|           opencover|A code coverage t...|      C#|2016-08-07 14:34:49|    7

## <font color=purple>Prettier Charting</font> 📃

That projects DataFrame is pretty intense and not all that easy to read - let's make it much more interesting!

### We can define a specific formatter that is tied to each DataFrame. 

The next time we display a DataFrame, we'll generate some prettier HTML.
We can make our data match our preferred purple .NET color scheme!

In [6]:
Microsoft.DotNet.Interactive.Rendering.Formatter<Microsoft.Spark.Sql.DataFrame>.Register((df, writer) =>
{
    var headers = new List<dynamic>();
    var columnNames = df.Columns();
    headers.Add(th(i("index")));
    headers.AddRange(columnNames.Select(c => th(c)));

    var rows = new List<List<dynamic>>();
    var currentRow = 0;
    var dfRows = df.Take(20);
    foreach (Row dfRow in dfRows)
    {
        var cells = new List<dynamic>();
        cells.Add(td(currentRow));

        foreach (string columnName in columnNames)
        {
            cells.Add(td(dfRow.Get(columnName)));
        }

        rows.Add(cells);
        ++currentRow;
    }

    var t = table[@border: "0.1"](
        thead[@style: "background-color: purple; color: white; font-family: Verdana"](headers),
        tbody[@style: "color: indigo; font-size: 14px"](rows.Select(r => tr(r))));

    writer.Write(t);
}, "text/html");

### Let's display a prettier version of our projects DataFrame.
We can display a more concise subset of the data and use our beautiful new HTML rendering.

In [7]:
DataFrame projects_cleaned = projects.Drop("id", "url", "owner_id", "created_at", "updated_at");
projects_cleaned

index,name,descriptor,language,forked_from,deleted
0,mono,"Mono open source ECMA CLI, C# and .NET implementation.",C#,581,1
1,opencover,"A code coverage tool for .NET 2 and above, support for 32 and 64 processes with both branch and sequence points; roots proudly based in PartCover",C#,7312471,0
2,opencover,"A code coverage tool for .NET 2 and above, support for 32 and 64 processes with both branch and sequence points; roots proudly based in PartCover",C#,7312471,0
3,opencover,"A code coverage tool for .NET 2 and above (WINDOWS OS only), support for 32 and 64 processes with both branch and sequence points",C#,7312471,0
4,SignalR,"Async library for .NET to help build real-time, multi-user interactive web applications.",C#,862,1
5,EndlessSpace-GalaxyBalancing,A modification to the galaxy generation so that a balanced galaxy can be generated.,C#,922,0
6,NETDeob,\N,C#,949,0
7,mongodb-csharp,A driver written in c# to connect to the MongoDB document oriented database.,C#,986,0
8,mldeploy,This is a db deploy clone for the Marklogic xml database written written in C#,C#,1217,0
9,NDllInjector,<null>,C#,2088641,1


## <font color=purple>More Advanced Queries</font> 💡
We can read in our data and display it in interesting ways. Now, let's start gaining some more advanced insights into our data.

### We can use functional programming to find the top C# projects by stars.

In [8]:
DataFrame stars = projects
        .Join(watchers, Col("id") == watchers["repo_id"])
        .GroupBy("name")
        .Agg(Count("*").Alias("stars"))
        .OrderBy(Desc("stars"));
stars

index,name,stars
0,shadowsocks-windows,38807
1,CodeHub,23765
2,corefx,18316
3,PowerShell,14740
4,Wox,14080
5,roslyn,12705
6,coreclr,12675
7,WaveFunctionCollapse,12215
8,SignalR,11105
9,ShareX,10834


### Let's analyze developer commit patterns over a week. 📆
#### For top-starred projects - do people work more over weekdays or weekends?
Label the projects data and apply further functions to start seeing patterns in our data. 

We'll end up sorting projects in order of most to least stars and finding out when people are committing to them. Day 1 = Sunday and Day 7 = Saturday.

In [9]:
DataFrame projects_aliased = projects
        .As("projects_aliased")
        .Select(Col("id").As("p_id"),
                Col("name").As("p_name"),
                Col("language"),
                Col("created_at").As("p_created_at"));

DataFrame patterns = commits
        .Join(projects_aliased, commits["project_id"] == projects_aliased["p_id"])
        .Join(stars.Limit(10), Col("name") == projects_aliased["p_name"])
        .Select(DayOfWeek(Col("created_at")).Alias("commit_day"),
                Col("id").As("commit_id"),
                Col("p_name").Alias("project_name"),
                Col("stars"))
        .GroupBy(Col("project_name"), Col("commit_day"), Col("stars"))
        .Agg(Count(Col("commit_id")).Alias("commits"))
        .OrderBy(Asc("project_name"), Asc("commit_day"))
        .Select(Col("project_name"),
                Col("commit_day"),
                Col("commits"),
                Col("stars"));

patterns

index,project_name,commit_day,commits,stars
0,CodeHub,1,245,23765
1,CodeHub,2,147,23765
2,CodeHub,3,143,23765
3,CodeHub,4,166,23765
4,CodeHub,5,94,23765
5,CodeHub,6,165,23765
6,CodeHub,7,316,23765
7,PowerShell,1,810,14740
8,PowerShell,2,1975,14740
9,PowerShell,3,2561,14740


### Instead of finding total number of commits each day of the week, let's find what % of commits happen on which days.

In [10]:
DataFrame patterns_cache = patterns.Cache();

DataFrame q = patterns_cache.GroupBy("project_name").Agg(Sum("commits").Alias("total"));

DataFrame result = patterns_cache
    .Join(q, patterns_cache["project_name"] == q["project_name"])
    .Select(patterns_cache["project_name"], Col("commit_day"), Round((Col("commits")*100/q["total"]),2)
            .As("commits"), Col("stars"));

result

index,project_name,commit_day,commits,stars
0,Wox,2,15.06,14080
1,Wox,5,14.37,14080
2,Wox,1,16.72,14080
3,Wox,3,12.67,14080
4,Wox,7,15.35,14080
5,Wox,4,12.52,14080
6,Wox,6,13.32,14080
7,ShareX,7,11.09,10834
8,ShareX,2,15.89,10834
9,ShareX,3,13.26,10834


## <font color=purple>Prettier Plotting</font> 📊
Finally, now that we have some really interesting insights into our data, let's visualize it!

### Let's create a bar graph to visualize our commit patterns.

In [11]:
using XPlot.Plotly;

var projects = new List<string>{"CodeHub", "PowerShell", "ShareX"};
var commitsSu = new List<double>();
var commitsMo = new List<double>();
var commitsTu = new List<double>();
var commitsWe = new List<double>();
var commitsTh = new List<double>();
var commitsFr = new List<double>();
var commitsSa = new List<double>();

foreach(Row row in result.Take(21))
{
    int day_of_week = (row.GetAs<int>("commit_day"));
    double commits;
    
    switch (day_of_week)
    {
            case 1:
                commits = (row.GetAs<double>("commits"));
                commitsSu.Add(commits);
                break;
            case 2:
                commits = (row.GetAs<double>("commits"));
                commitsMo.Add(commits);
                break;
            case 3:
                commits = (row.GetAs<double>("commits"));
                commitsTu.Add(commits);
                break;
            case 4:
                commits = (row.GetAs<double>("commits"));
                commitsWe.Add(commits);
                break;
            case 5:
                commits = (row.GetAs<double>("commits"));
                commitsTh.Add(commits);
                break;
            case 6:
                commits = (row.GetAs<double>("commits"));
                commitsFr.Add(commits);
                break;
            default:
                commits = (row.GetAs<double>("commits"));
                commitsSa.Add(commits);
                break;
      }
}
var sunday = new Graph.Bar
{
    name = "Sun",
    x = projects,
    y = commitsSu
};
var monday = new Graph.Bar
{
    name = "Mon",
    x = projects,
    y = commitsMo
};
var tuesday = new Graph.Bar
{
    name = "Tue",
    x = projects,
    y = commitsTu
};
var wednesday = new Graph.Bar
{
    name = "Wed",
    x = projects,
    y = commitsWe
};
var thursday = new Graph.Bar
{
    name = "Th",
    x = projects,
    y = commitsTh
};
var friday = new Graph.Bar
{
    name = "Fri",
    x = projects,
    y = commitsFr
};
var saturday = new Graph.Bar
{
    name = "Sat",
    x = projects,
    y = commitsSa
};

var chart = Chart.Plot(new [] {sunday, monday, tuesday, wednesday, thursday, friday, saturday});
chart.WithLayout(new Layout.Layout{barmode = "stack"});
chart.WithTitle("Developer Commit Patterns Over a Week");
chart.WithXTitle("Project");
chart.WithYTitle("% of Weekly Commits");
display(chart);

## Bringing big data to .NET developers in the languages they love!