Skip to content

Commit

Permalink
lots of content
Browse files Browse the repository at this point in the history
  • Loading branch information
veekaybee committed Dec 29, 2021
1 parent 1502875 commit dee6518
Show file tree
Hide file tree
Showing 37 changed files with 722 additions and 820 deletions.
8 changes: 5 additions & 3 deletions content/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@ bookToC = false

# Machine Learning: The Boring Parts

There are lots of articles out there about the fun, sexy parts of machine learning: neural networks, AI, online learning, billions of parameters. Everyone's having a blast.
There are lots of articles out there about the fun, sexy parts of machine learning: neural networks, AI, online learning, billions of parameters.

This is not that site. This site is one that I always wanted to see: the internal, boring, day-to-day work that machine learning entails. It's mostly a collection of technical notes I've made over the years of things that have been useful to me personally and that I continue to learn on a day-to-day basis.
This is not that site. This site is one that I always wanted to see: the internal, boring, day-to-day work that machine learning entails. It's a collection of technical notes I've made over the years of things that have been useful to me personally and that I continue to learn on a day-to-day basis, and in many ways is the resource I wish I'd had when I started getting into machine learning.

If you have questions or corrections, [feel free to submit a PR.](https://github.com/veekaybee/boringml)
It and its layout are also constantly evolving as I map different concepts together, make changes and edits, so don't be surprised if content moves around.

If you have comments or corrections, [feel free to submit a PR.](https://github.com/veekaybee/boringml)

## About Me

Expand Down
3 changes: 3 additions & 0 deletions content/docs/Computer Science/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
---
bookCollapseSection: true
---
36 changes: 36 additions & 0 deletions content/docs/Computer Science/hash-aggregate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
+++
title = "Hash aggregates"
type = "docs"
bookToC = false
+++


{{< figure src="https://raw.githubusercontent.com/veekaybee/veekaybee.github.io/master/static/images/checkers.png" width="300px" >}}


This post is an expansion of this tweet:

{{< tweet 1280911880157093888 >}}



## Hash Aggregate Here
But data work also has its own unique patterns, and I want to talk about one of these that I think is important for all of us to carry around in our back pockets: the humble hash aggregate. The hash aggregate [works like this](https://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/):

{{< figure src="https://raw.githubusercontent.com/veekaybee/veekaybee.github.io/master/static/images/split-apply-combine.png" width="600px">}}



You have a multidimensional array (or, as us plebes say, table) that contains numerous instances of similar labels. What you want to know is the distinct counts of each category. The implemented algorithm splits the matrix by key and sums the values and then returns the reduced matrix that has only unique keys and the sum values to the user.

It's a very simple and ingenious algorithm, and it shows up over and over and over again. If you've ever done a GROUP BY statement in SQL, you've used the hash aggregate function. Python's dictionary operations utilize hash aggregates. And so does Pandas' split-apply-combine (pictured here from Jake's great post) And, so does Excel's [Pivot table function](https://en.wikipedia.org/wiki/Pivot_table). So does `sort filename | uniq -c | sort -nr` in Unix. So does the map/reduce pattern that started in Hadoop, [and has been implemented in-memory in Spark. ](https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/) An [inverted index](https://en.wikipedia.org/wiki/Inverted_index), the foundation for Elasticsearch (and many search and retrieval platforms) is a hash aggregate.

## So what?

If you've worked with either development or data for any length of time, it's almost guaranteed that you've come across the need to get unique categories of things and then count the things in those categories. In some cases, you might need to build your own implementation of GROUP BY because it doesn't work in your language or framework of choice.

My personal opinion is that every data-centric framework that's been around long enough tends to SQL, so everything will [eventually implement hash aggregation.](https://docs.confluent.io/5.2.0/ksql/docs/developer-guide/aggregate-streaming-data.html)

Once you understand that hash aggregation is a common pattern, it makes sense to observe it at work, learn more about how to optimize it, and generally think about it.

Once we know that this pattern has a name and exists, we have a sense of power over our data work. Confuscius (or whoever attributed the quote to him) once said, “The beginning of wisdom is to call things by their proper name," and either he was once a curious toddler, or an MLE looking to better understand the history and context of his architecture.
50 changes: 0 additions & 50 deletions content/docs/Frameworks/AWS Lambads.md

This file was deleted.

2 changes: 2 additions & 0 deletions content/docs/Frameworks/AWS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# AWS

2 changes: 1 addition & 1 deletion content/docs/Frameworks/_index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
---
bookFlatSection: true
bookCollapseSection: true
---
22 changes: 0 additions & 22 deletions content/docs/Frameworks/details.md

This file was deleted.

File renamed without changes.
65 changes: 65 additions & 0 deletions content/docs/Frameworks/spark/create-data-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
+++
title = "Sample data in PySpark"
type = "docs"
bookToC = false
+++

Here's how to create a small fake dataset for testing in PySpark. More on [sc.parallelize](https://spark.apache.org/docs/2.1.1/programming-guide.html#parallelized-collections).

```python
from pyspark.sql.session import SparkSession
rdd = sc.parallelize([(0,None), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df=rdd.toDF(['id','score'])
df.show()
```
```
+---+-----+
| id|score|
+---+-----+
| 0| null|
| 0| 1|
| 0| 2|
| 1| 2|
| 1| 10|
| 1| 20|
| 3| 18|
| 3| 18|
| 3| 18|
+---+-----+
```

```
df.printSchema()
root
|-- id: long (nullable = true)
|-- score: long (nullable = true)
```

None is a special keyword in Python that will let you create nullable fields.
If you want to simulate NaN fields, you can do `float('nan')` for the value.
Note that if you don't specify each field as float, you get a null result for the values that are not typed.

```python
from pyspark.sql.session import SparkSession
import numpy as np
rdd = sc.parallelize([(0,np.nan), (0,float(1)), (0,float(2)), (1,float(2)), (1,float(10)), (1,float(20)), (3,float(18)), (3,float(18)), (3,18)])
df=rdd.toDF(['id','score'])
df.show()
```

```
+---+-----+
| id|score|
+---+-----+
| 0| NaN|
| 0| 1.0|
| 0| 2.0|
| 1| 2.0|
| 1| 10.0|
| 1| 20.0|
| 3| 18.0|
| 3| 18.0|
| 3| null|
+---+-----+
```
85 changes: 0 additions & 85 deletions content/docs/Languages/PHP/_index.md

This file was deleted.

60 changes: 0 additions & 60 deletions content/docs/Languages/PHP/arrays.md

This file was deleted.

Loading

0 comments on commit dee6518

Please sign in to comment.