Foundation of Big Data Analysis

Data mining algorithms implemented in python (+pyspark)

adding brief explanation for each part

1. Finding potential friends (PySpark)

Discover potential friends who have many mutual friends using a real dataset containing a LiveJournal friends graph.

Algorithm

각 line을 (User, [list of friends])로 map
User와 list of friend를 이용하여 직접 친구 관계인 두 명을 ((user, friend), 0)으로 flat map : real
List of friends에 들어있는 사람들은 해당 User를 mutual friend로 두고 있으므로 List of friend 중에서 둘씩 조합하여 ((friend1, friend2), 1)로 flat map : potential
- 2, 3번 과정에서 first user id가 second user id보다 작도록 정렬
Potential friends이면서 real friends는 아닌 쌍을 찾기 위해 subtract by key 연산
남은 potential friends pair를 reduce by key를 통해 합치고 이때 value를 합산한다. i.e. ((user1, user2), 총 mutual friend의 수 = count)
Count, user1, user2순으로 정렬하여 top-10 pair를 찾고 순서대로 print

2. A-priori: Frequent item

A-priori : an algorithm for finding frequent item. good at avoiding counting many triples or larger sets

Algorithm

pass 1: make two tables
- Table 1: translate item names to integers 1,...,n
- Table 2: array of counts for n items initialized to 0
between pass 1 & 2
- find singletons with support >= *s
- create new numbering 1,...,m for frequent items
- create frequent-items table indexed 1,..,n where ith entry is 0 if item i is not frequent or a unique integer in 1,..,m otherwise
pass 2: find pairs
- for each basket, check which items are frequent
- generate all pairs of frequent items in basket
- for each such pair, increment count in data structure (triangular matrix or triples)

Larger frequent itemsets then pairs can be found by repeating algorithm

3. LSH: Finding similar documents

Find similar items using minhash signatures
Divide signature matrix into b bands of r rows each
For each band, there is hash function on r integers
Use the characteristic that similar columns are more likely to be identical in some band
In this version of implementation, I used 3-shingles to make minhash signatures and b=6, r=20 for LSH algorithm
- Pr(signatures agree in all rows of at least one band and thus become a candidate pair) = 1-(1-s^r)^b

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
A-priori.py		A-priori.py
Collaborative filtering.py		Collaborative filtering.py
DGIM algorithm.py		DGIM algorithm.py
Finding potential friends.py		Finding potential friends.py
Girvan-Newman.py		Girvan-Newman.py
Gradient descent SVM.py		Gradient descent SVM.py
K-means.py		K-means.py
LSH.py		LSH.py
Movie recommendation.py		Movie recommendation.py
PageRank.py		PageRank.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Foundation of Big Data Analysis

1. Finding potential friends (PySpark)

Algorithm

2. A-priori: Frequent item

Algorithm

3. LSH: Finding similar documents

4. K-means algorithm (PySpark)

5. Collaborative filtering

6. PageRank algorithm (PySpark)

7. Girvan-Newman algorithm (PySpark)

8. Gradient descent SVM algorithm

9. DGIM algorithm

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Foundation of Big Data Analysis

1. Finding potential friends (PySpark)

Algorithm

2. A-priori: Frequent item

Algorithm

3. LSH: Finding similar documents

4. K-means algorithm (PySpark)

5. Collaborative filtering

6. PageRank algorithm (PySpark)

7. Girvan-Newman algorithm (PySpark)

8. Gradient descent SVM algorithm

9. DGIM algorithm

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages