Skip to content

Commit dee6518

Browse files
committed
lots of content
1 parent 1502875 commit dee6518

37 files changed

+722
-820
lines changed

content/_index.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@ bookToC = false
66

77
# Machine Learning: The Boring Parts
88

9-
There are lots of articles out there about the fun, sexy parts of machine learning: neural networks, AI, online learning, billions of parameters. Everyone's having a blast.
9+
There are lots of articles out there about the fun, sexy parts of machine learning: neural networks, AI, online learning, billions of parameters.
1010

11-
This is not that site. This site is one that I always wanted to see: the internal, boring, day-to-day work that machine learning entails. It's mostly a collection of technical notes I've made over the years of things that have been useful to me personally and that I continue to learn on a day-to-day basis.
11+
This is not that site. This site is one that I always wanted to see: the internal, boring, day-to-day work that machine learning entails. It's a collection of technical notes I've made over the years of things that have been useful to me personally and that I continue to learn on a day-to-day basis, and in many ways is the resource I wish I'd had when I started getting into machine learning.
1212

13-
If you have questions or corrections, [feel free to submit a PR.](https://github.com/veekaybee/boringml)
13+
It and its layout are also constantly evolving as I map different concepts together, make changes and edits, so don't be surprised if content moves around.
14+
15+
If you have comments or corrections, [feel free to submit a PR.](https://github.com/veekaybee/boringml)
1416

1517
## About Me
1618

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
---
2+
bookCollapseSection: true
3+
---
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
+++
2+
title = "Hash aggregates"
3+
type = "docs"
4+
bookToC = false
5+
+++
6+
7+
8+
{{< figure src="https://raw.githubusercontent.com/veekaybee/veekaybee.github.io/master/static/images/checkers.png" width="300px" >}}
9+
10+
11+
This post is an expansion of this tweet:
12+
13+
{{< tweet 1280911880157093888 >}}
14+
15+
16+
17+
## Hash Aggregate Here
18+
But data work also has its own unique patterns, and I want to talk about one of these that I think is important for all of us to carry around in our back pockets: the humble hash aggregate. The hash aggregate [works like this](https://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/):
19+
20+
{{< figure src="https://raw.githubusercontent.com/veekaybee/veekaybee.github.io/master/static/images/split-apply-combine.png" width="600px">}}
21+
22+
23+
24+
You have a multidimensional array (or, as us plebes say, table) that contains numerous instances of similar labels. What you want to know is the distinct counts of each category. The implemented algorithm splits the matrix by key and sums the values and then returns the reduced matrix that has only unique keys and the sum values to the user.
25+
26+
It's a very simple and ingenious algorithm, and it shows up over and over and over again. If you've ever done a GROUP BY statement in SQL, you've used the hash aggregate function. Python's dictionary operations utilize hash aggregates. And so does Pandas' split-apply-combine (pictured here from Jake's great post) And, so does Excel's [Pivot table function](https://en.wikipedia.org/wiki/Pivot_table). So does `sort filename | uniq -c | sort -nr` in Unix. So does the map/reduce pattern that started in Hadoop, [and has been implemented in-memory in Spark. ](https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/) An [inverted index](https://en.wikipedia.org/wiki/Inverted_index), the foundation for Elasticsearch (and many search and retrieval platforms) is a hash aggregate.
27+
28+
## So what?
29+
30+
If you've worked with either development or data for any length of time, it's almost guaranteed that you've come across the need to get unique categories of things and then count the things in those categories. In some cases, you might need to build your own implementation of GROUP BY because it doesn't work in your language or framework of choice.
31+
32+
My personal opinion is that every data-centric framework that's been around long enough tends to SQL, so everything will [eventually implement hash aggregation.](https://docs.confluent.io/5.2.0/ksql/docs/developer-guide/aggregate-streaming-data.html)
33+
34+
Once you understand that hash aggregation is a common pattern, it makes sense to observe it at work, learn more about how to optimize it, and generally think about it.
35+
36+
Once we know that this pattern has a name and exists, we have a sense of power over our data work. Confuscius (or whoever attributed the quote to him) once said, “The beginning of wisdom is to call things by their proper name," and either he was once a curious toddler, or an MLE looking to better understand the history and context of his architecture.

content/docs/Frameworks/AWS Lambads.md

Lines changed: 0 additions & 50 deletions
This file was deleted.

content/docs/Frameworks/AWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# AWS
2+

content/docs/Frameworks/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
---
2-
bookFlatSection: true
2+
bookCollapseSection: true
33
---

content/docs/Frameworks/details.md

Lines changed: 0 additions & 22 deletions
This file was deleted.
File renamed without changes.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
+++
2+
title = "Sample data in PySpark"
3+
type = "docs"
4+
bookToC = false
5+
+++
6+
7+
Here's how to create a small fake dataset for testing in PySpark. More on [sc.parallelize](https://spark.apache.org/docs/2.1.1/programming-guide.html#parallelized-collections).
8+
9+
```python
10+
from pyspark.sql.session import SparkSession
11+
rdd = sc.parallelize([(0,None), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
12+
df=rdd.toDF(['id','score'])
13+
df.show()
14+
```
15+
```
16+
+---+-----+
17+
| id|score|
18+
+---+-----+
19+
| 0| null|
20+
| 0| 1|
21+
| 0| 2|
22+
| 1| 2|
23+
| 1| 10|
24+
| 1| 20|
25+
| 3| 18|
26+
| 3| 18|
27+
| 3| 18|
28+
+---+-----+
29+
```
30+
31+
```
32+
df.printSchema()
33+
root
34+
|-- id: long (nullable = true)
35+
|-- score: long (nullable = true)
36+
37+
```
38+
39+
None is a special keyword in Python that will let you create nullable fields.
40+
If you want to simulate NaN fields, you can do `float('nan')` for the value.
41+
Note that if you don't specify each field as float, you get a null result for the values that are not typed.
42+
43+
```python
44+
from pyspark.sql.session import SparkSession
45+
import numpy as np
46+
rdd = sc.parallelize([(0,np.nan), (0,float(1)), (0,float(2)), (1,float(2)), (1,float(10)), (1,float(20)), (3,float(18)), (3,float(18)), (3,18)])
47+
df=rdd.toDF(['id','score'])
48+
df.show()
49+
```
50+
51+
```
52+
+---+-----+
53+
| id|score|
54+
+---+-----+
55+
| 0| NaN|
56+
| 0| 1.0|
57+
| 0| 2.0|
58+
| 1| 2.0|
59+
| 1| 10.0|
60+
| 1| 20.0|
61+
| 3| 18.0|
62+
| 3| 18.0|
63+
| 3| null|
64+
+---+-----+
65+
```

content/docs/Languages/PHP/_index.md

Lines changed: 0 additions & 85 deletions
This file was deleted.

0 commit comments

Comments
 (0)