# Homework

Let's consider the classification problem into the following two classes:
- 1 for 'American_movie_actors'
- 0 for 'American_stage_actors'

Instead of a dictionary, you can use hashing.

You are invited to check how the model behaves after using hashing and answer the following questions:
1. **What roc_auc_score on the test sample is obtained when using a dictionary?**
2. **What roc_auc_score on the test sample is obtained when switching from dictionary to hashing?**

Details:
1. Divide the samples into training and test by parity `id` articles: even for training, odd for test. Only for the training part, we count the gradients!
2. To calculate roc_auc_score, you need to get predictions and true answers for examples from the test set. All pairs (prediction, answer) fit into memory, use it!
3. Use `murmurhash3_32(x) % 2**20` as the hash function.
4. Fix the random seed at the initial guess of the weights: `np.random.seed(0); weights = np.random.random(...)`
5. Train 500 epochs in 0.3 increments. After each epoch, call `weights_broadcast.destroy()` to remove the broadcast variable so you don't run out of memory.
6. This is what roc_auc_score looks like on the test sample from the number of epochs (the more roc_auc_score, the better):
<img src="images/test_auc.png" width="600px"></img>

Save the solution to the `result.json` file. 
File content example:

```json
{
    "q1": 0.123,
    "q2": 0.456
}

In [37]:
from sklearn.utils import murmurhash3_32 as mur

In [2]:
from sklearn.metrics import roc_auc_score

# y_true - real classes
# y_score - class 1 probabilities
# https://en.wikipedia.org/wiki/Receiver_operating_characteristic
roc_auc_score(y_true=[1, 1, 0, 0], y_score=[0.8, 0.7, 0.3, 0.2])

1.0

## Step 1: Dataset preparation for a classification model

In [3]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName='jupyter')

from pyspark.sql import SparkSession, Row
se = SparkSession(sc)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2023-03-30 17:23:04,863 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [4]:
# ! hadoop fs -copyFromLocal wiki /

In [5]:
wiki = se.read.json("hdfs:///wiki/wiki.jsonl")
wiki.registerTempTable("wiki")
wiki.limit(2).toPandas()

                                                                                

Unnamed: 0,id,text,title,url
0,1,April\n\nApril is the fourth month of the year...,April,https://simple.wikipedia.org/wiki?curid=1
1,2,August\n\nAugust (Aug.) is the eighth month of...,August,https://simple.wikipedia.org/wiki?curid=2


In [6]:
categories = se.read.json("hdfs:///wiki/categories.jsonl")
categories.registerTempTable("categories")
categories.limit(2).toPandas()

                                                                                

Unnamed: 0,category,page_id
0,Months,1
1,Months,2


In [7]:
df = se.sql("""
select
    id,
    text,
    title,
    url,
    cast(categories.category == 'American_movie_actors' as int) as target
from
    wiki join categories on wiki.id == categories.page_id
where categories.category in ('American_movie_actors', 'American_stage_actors')
""")
df.limit(5).toPandas()

                                                                                

Unnamed: 0,id,text,title,url,target
0,5692,Natalie Portman\n\nNatalie Portman (born Neta-...,Natalie Portman,https://simple.wikipedia.org/wiki?curid=5692,0
1,5692,Natalie Portman\n\nNatalie Portman (born Neta-...,Natalie Portman,https://simple.wikipedia.org/wiki?curid=5692,1
2,5240,"Elizabeth Taylor\n\nDame Elizabeth ""Liz"" Rosem...",Elizabeth Taylor,https://simple.wikipedia.org/wiki?curid=5240,0
3,5240,"Elizabeth Taylor\n\nDame Elizabeth ""Liz"" Rosem...",Elizabeth Taylor,https://simple.wikipedia.org/wiki?curid=5240,1
4,5230,Gwyneth Paltrow\n\nGwyneth Kate Paltrow (born ...,Gwyneth Paltrow,https://simple.wikipedia.org/wiki?curid=5230,0


In [8]:
# df.write.mode('overwrite').json("/actors.jsonl")

In [9]:
df_train = se.sql("""
select
    text,
    cast(categories.category == 'American_movie_actors' as int) as target
from
    wiki join categories on wiki.id == categories.page_id
where categories.category in ('American_movie_actors', 'American_stage_actors')
and id % 2 = 0
""")
df_train.limit(5).toPandas()

                                                                                

Unnamed: 0,text,target
0,Natalie Portman\n\nNatalie Portman (born Neta-...,0
1,Natalie Portman\n\nNatalie Portman (born Neta-...,1
2,"Elizabeth Taylor\n\nDame Elizabeth ""Liz"" Rosem...",0
3,"Elizabeth Taylor\n\nDame Elizabeth ""Liz"" Rosem...",1
4,Gwyneth Paltrow\n\nGwyneth Kate Paltrow (born ...,0


In [10]:
df_test = se.sql("""
select
    text,
    cast(categories.category == 'American_movie_actors' as int) as target
from
    wiki join categories on wiki.id == categories.page_id
where categories.category in ('American_movie_actors', 'American_stage_actors')
and id % 2 <> 0
""")
df_test.limit(5).toPandas()

                                                                                

Unnamed: 0,text,target
0,Britney Spears\n\nBritney Jean Spears (born De...,1
1,Angelina Jolie\n\nAngelina Jolie (; née Voight...,1
2,"Brad Pitt\n\nWilliam Bradley ""Brad"" Pitt (born...",1
3,"Nicole Kidman\n\nNicole Mary Kidman, AC (born ...",1
4,Christina Ricci\n\nChristina Ricci (born Febru...,1


## Step 2: Create index of words regards with their population in a text

In [11]:
import re
import string

def tokenize(text):
    text = re.sub(f'[^{re.escape(string.printable)}]', ' ', text)  # replace unprintable characters with a space
    text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)  # and punctuation
    words = text.lower().split()
    return words

In [12]:
import json

def mapper(line):
    text = json.loads(line)['text']
    words = tokenize(text)
    return [(word, 1) for word in set(words)]

In [13]:
%%time
word_counts = (
    sc.textFile("hdfs:///actors.jsonl")
    .flatMap(mapper)
    .reduceByKey(lambda a, b: a + b)
    .collect()
)



CPU times: user 38.4 ms, sys: 8.11 ms, total: 46.6 ms
Wall time: 5.77 s


                                                                                

In [14]:
len(word_counts)

23833

In [15]:
top_word_counts = sorted(word_counts, key=lambda x: -x[1])[:10000]
top_word_counts[:5]

[('american', 5519), ('in', 5381), ('an', 5315), ('and', 5152), ('the', 5119)]

In [16]:
top_word_counts[-5:]

[('cicely', 3), ('hahn', 3), ('feeney', 3), ('jigsaw', 3), ('portlandia', 3)]

In [17]:
# indexes are needed for vectorization of texts
word_to_index = {word: index for index, (word, count) in enumerate(top_word_counts)}

In [18]:
list(word_to_index.items())[:5]

[('american', 0), ('in', 1), ('an', 2), ('and', 3), ('the', 4)]

## Step 3: Prepare the train dataset "df_train" to train the model

In [19]:
from collections import Counter
import numpy as np

In [20]:
# # first option: the word_to_index dictionary will be serialized using pickle along with the function
# import numpy as np

# def mapper(line):
#     j = json.loads(line)
#     text = j['text']
#     words = tokenize(text)
#     indices = []
#     values = []
#     for word, count in Counter(words).items():
#         if word in word_to_index:
#             index = word_to_index[word]
#             indices.append(index)
#             tf = count / float(len(words))
#             values.append(tf)
#     return np.array(indices), np.array(values)

In [21]:
# # %%time
# (
#     sc.textFile("hdfs:///actors.jsonl")
#     .map(mapper)
#     .take(1)
# )

In [22]:
def mapper(row):
    words = tokenize(row.text)
    indices = []
    values = []
    for word, count in Counter(words).items():
        if word in word_to_index:
            index = word_to_index[word]
            indices.append(index)
            tf = count / float(len(words))
            values.append(tf)
    return np.array(indices), np.array(values), row.target

In [23]:
dataset = df_train.rdd.map(mapper)
dataset.cache()  # cache dataset in RAM
dataset.count()

                                                                                

2814

In [24]:
dataset.take(1)

[(array([2135, 8313,    5,  372, 5017, 5072, 5031, 2916, 4059, 5269,   69,
          166,  288,    8,    2, 4251,    0,   18,   21,    6,    1, 8015,
         3445,  394,  217,  397,  407,  345,  443,    3,  482,  218, 1375,
           19,  309,  554,   35,   40,   37,   29,  274,  194,  597,   22,
          135,  100,  560, 1947,   95, 4265, 4844,  145,   49,    4,   83,
          853,  105,  130,  106,    7,   27,   96,   17,  421, 4158,  126,
          181,  202,  134, 2291,  489,  151,  376,  599,   15,  262, 6592,
          846,   36,   25,   48,   10,  535,  179,  222]),
  array([0.01709402, 0.03418803, 0.01709402, 0.00854701, 0.01709402,
         0.00854701, 0.00854701, 0.00854701, 0.00854701, 0.00854701,
         0.00854701, 0.00854701, 0.00854701, 0.02564103, 0.01709402,
         0.00854701, 0.00854701, 0.01709402, 0.05128205, 0.00854701,
         0.05982906, 0.00854701, 0.00854701, 0.00854701, 0.00854701,
         0.00854701, 0.00854701, 0.00854701, 0.00854701, 0.02564103,
  

## Step 4: Prepare auxulary functions and train the model

In [25]:
def sigmoid(x):
    if x >= 0:
        return 1. / (1. + np.exp(-x))
    else:
        return np.exp(x) / (1. + np.exp(x))

In [26]:
def compute_gradient(weights_broadcast, loss, examples):
    # here we accumulate the contribution to the gradient
    gradient = np.zeros(len(weights_broadcast.value))
    
    for example in examples:
        indices, values, target = example

        # make a prediction with the current weights
        p = sigmoid(values.dot(weights_broadcast.value[indices]))

        # add to gradient accumulator
        gradient[indices] += values * (p - target)

        # count losses
        p = np.clip(p, 1e-15, 1-1e-15)
        loss.add(-(target * np.log(p) + (1 - target) * np.log(1 - p)))
    
    yield gradient

In [27]:
# number of examples
N = dataset.count()
N

                                                                                

2814

In [28]:
from functools import partial
import numpy as np

# random weights
np.random.seed(0)
weights = np.random.random(len(word_to_index))

# Gradient Descent Epoch
for i in range(500):
    weights_broadcast = sc.broadcast(weights)
    loss = sc.accumulator(0.0)
    
    # calculate the gradient
    gradient = (
        dataset
        .coalesce(2)  # merge 200 cached partitions into 2
        .mapPartitions(partial(compute_gradient, weights_broadcast, loss))
        .reduce(lambda a, b: a + b)
    )

    # update the weights
    weights -= 0.4 * gradient
    
    weights_broadcast.destroy()
    
    # print("epoch:", i, "loss:", loss.value / N)
print("epoch:", i, "loss:", loss.value / N)

epoch: 499 loss: 0.4831161902131253


In [29]:
weights, len(weights)

(array([ 8.59426268, -0.81586375, 14.76037071, ...,  0.75842952,
         0.87954432,  0.81357508]),
 10000)

## Step 5: Calculate "roc_auc_score" for test dataset "df_test"

In [30]:
dataset = df_test.rdd.map(mapper)
dataset.cache()  # cache dataset in RAM
dataset.count()

                                                                                

2785

In [31]:
def rocAuc(row):
    
    indices, values, target = row
    # make a prediction with the current weights
    p = sigmoid(values.dot(weights[indices]))
    
    return p, target

In [32]:
rocAuc_dataset = dataset.map(rocAuc)
rocAuc_dataset.take(3)

[(0.891036899042322, 1), (0.8581572602570084, 1), (0.7583840538462286, 1)]

In [33]:
y_score = rocAuc_dataset.keys().collect()

                                                                                

In [34]:
y_true = rocAuc_dataset.values().collect()

                                                                                

In [35]:
roc_auc_score(y_true, y_score)

0.6857984028894291

## Step 6: Create index of words regards with their population in a text using hashing

In [39]:
def tokenize(text):
    text = re.sub(f'[^{re.escape(string.printable)}]', ' ', text)  # replace unprintable characters with a space
    text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)  # and punctuation
    words = list(map(lambda x: mur(x) % 2**20, text.lower().split()))
    return words

def mapper(line):
    text = json.loads(line)['text']
    words = tokenize(text)
    return [(word, 1) for word in set(words)]

In [40]:
%%time
word_counts = (
    sc.textFile("hdfs:///actors.jsonl")
    .flatMap(mapper)
    .reduceByKey(lambda a, b: a + b)
    .collect()
)



CPU times: user 24 ms, sys: 13.6 ms, total: 37.7 ms
Wall time: 5.04 s


                                                                                

In [41]:
len(word_counts)

23572

In [42]:
top_word_counts = sorted(word_counts, key=lambda x: -x[1])[:10000]
top_word_counts[:5]

[(510525, 5519),
 (828689, 5381),
 (715111, 5315),
 (868051, 5152),
 (761698, 5119)]

In [43]:
top_word_counts[-5:]

[(174814, 3), (257118, 3), (820126, 3), (516702, 3), (216318, 3)]

In [44]:
# indexes are needed for vectorization of texts
word_to_index = {word: index for index, (word, count) in enumerate(top_word_counts)}

In [45]:
list(word_to_index.items())[:5]

[(510525, 0), (828689, 1), (715111, 2), (868051, 3), (761698, 4)]

## Step 7: Prepare the train dataset "df_train" to train the model using hashin

In [46]:
def mapper(row):
    words = tokenize(row.text)
    indices = []
    values = []
    for word, count in Counter(words).items():
        if word in word_to_index:
            index = word_to_index[word]
            indices.append(index)
            tf = count / float(len(words))
            values.append(tf)
    return np.array(indices), np.array(values), row.target

In [47]:
dataset = df_train.rdd.map(mapper)
dataset.cache()  # cache dataset in RAM
dataset.count()

                                                                                

2814

In [48]:
dataset.take(1)

[(array([2091, 8877,    5,  375, 4901, 5024, 4841, 2903, 3936, 5200,   69,
          166,  288,    8,    2, 4348,    0,   18,   21,    6,    1, 7010,
         3462,  394,  217,  398,  407,  347,  448,    3,  483,  218, 1409,
           19,  310,  557,   35,   40,   37,   29,  274,  191,  599,   22,
          135,  100,  565, 1894,   95, 4281, 4864,  145,   49,    4,   83,
          861,  105,  130,  107,    7,   27,   96,   17,  421, 4414,  126,
          181,  202,  134, 2278,  490,  149,  376,  597,   15,  262, 6768,
          843,   36,   25,   48,   10,  536,  179,  222]),
  array([0.01709402, 0.03418803, 0.01709402, 0.00854701, 0.01709402,
         0.00854701, 0.00854701, 0.00854701, 0.00854701, 0.00854701,
         0.00854701, 0.00854701, 0.00854701, 0.02564103, 0.01709402,
         0.00854701, 0.00854701, 0.01709402, 0.05128205, 0.00854701,
         0.05982906, 0.00854701, 0.00854701, 0.00854701, 0.00854701,
         0.00854701, 0.00854701, 0.00854701, 0.00854701, 0.02564103,
  

In [49]:
# number of examples
N = dataset.count()
N

                                                                                

2814

In [50]:
# random weights
np.random.seed(0)
weights = np.random.random(len(word_to_index))

# Gradient Descent Epoch
for i in range(500):
    weights_broadcast = sc.broadcast(weights)
    loss = sc.accumulator(0.0)
    
    # calculate the gradient
    gradient = (
        dataset
        .coalesce(2)  # merge 200 cached partitions into 2
        .mapPartitions(partial(compute_gradient, weights_broadcast, loss))
        .reduce(lambda a, b: a + b)
    )

    # update the weights
    weights -= 0.4 * gradient
    
    weights_broadcast.destroy()
    
    # print("epoch:", i, "loss:", loss.value / N)
print("epoch:", i, "loss:", loss.value / N)

epoch: 499 loss: 0.4830229825164852


In [51]:
weights, len(weights)

(array([ 8.34111336, -0.87491418, 14.74509065, ...,  1.35699863,
         0.02378743,  1.19216271]),
 10000)

In [52]:
## Step 8: Calculate "roc_auc_score" for test dataset "df_test" using hashing

In [53]:
dataset = df_test.rdd.map(mapper)
dataset.cache()  # cache dataset in RAM
dataset.count()

                                                                                

2785

In [54]:
rocAuc_dataset = dataset.map(rocAuc)
rocAuc_dataset.take(3)

[(0.8880199106275367, 1), (0.859675802580131, 1), (0.7565031406935169, 1)]

In [55]:
y_score = rocAuc_dataset.keys().collect()

                                                                                

In [56]:
y_true = rocAuc_dataset.values().collect()

                                                                                

In [57]:
roc_auc_score(y_true, y_score)

0.6851919936196855

In [69]:
# stop Spark (and YARN application)
# sc.stop()

In [4]:
data = { "q1": 0.57, "q2": 0.6851919936196855 }
json.dumps(data)

'{"q1": 0.57, "q2": 0.6851919936196855}'

In [5]:
with open('result.json', 'w') as f:
    f.write(json.dumps(data))

In [6]:
! curl -F file=@result.json "51.250.54.133:80/MDS-LSML1/aizmalkin/w3/1"

1.0
Correct q1 answer! Correct q2 answer!
