# Lab 6 Group 9
## Michael Karaman, Alexys Lamkin, Yiyun Zhang

### Exercise 1

In the "Combining Individual Probabilities" section of the "Naive Bayes Spam Filtering" Wikipedia page, the author lists the following equation to use when considering multiple words for classification, assuming independence between each word:

$p = \frac{p_1p_2 \ldots p_N}{p_1p_2 \ldots p-N + (1-p_1)(1-p_2) \ldots (1-p_N)}$

where:
*   $p$ is the probability that an email is spam
*   $p_{n \in N}$ is the probability $p(S|W_{n \in N})$ i.e., an email is spam email knowing it contains word $n$ (where $N$ is the total number of words in the email)

Not only is the author's description poorly-worded, the equation listed is also incorrect. When considering multiple independent events, each conditional probabilitiy would be multipled such that:

$P(W_1, W_2, \ldots, W_N|S) \approx P(W_1|S) \times P(W_2|S) \times \ldots \times P(W_N|S) $

Using the correct condititonal probability given independent events, the author's equation should've been:

$P(S|W_1, W_2, \ldots, W_N) = \frac{P(W_1, W_2, \ldots, W_N|S)P(S)}{P(W_1, W_2, \ldots, W_N|S)P(S) + P(W_1, W_2, \ldots, W_N|S)P(H)}$




### Exercise 2
$P(S|B,C)=\frac{P(B,C|S)P(S)}{P(B,C|S)P(S)+P(B,C|H)P(H)}=\frac{\frac{12}{1}}{\frac{12}{1}+\frac{2}{3}}\approx0.95$ \
It is larger than both $P(S|B)=80\%$ and $P(S|,B,C,W)=90\%$, which means emails containing BUY and CHEAP are more likely to be spam.

### Exercise 3

Classification problems can be generalized as using a categorization function $c(x)$ to categorize an instance $x$ (which exists in the domain $X$) within range $C$. Examples of categories $C$ includes "spam" and "topics" and examples of instances within such categories include "spam" and "ham" or "finance" and "sports", respectively. 

The Bayesian theorem is a method for solving classification problems. The theorem uses a prior probability for each category given no informtion about each instance and produces a posterior probability distribution over all the possible categories given each instance. 

The Maximum a posteriori (MAP) hypothesis can be used to simplify the Bayesian calculation by eliminating the need to calculate $P(x)$ and as a result, the final probability will not need to be normalized. 

The Naive Bayes Classifier can be derived in two steps:
1.   Assume that an instance $X$ can be described by a n-dimensonal vector of attributes
2.   Assume independence between instances (i.e., the conditional independence assumption) 

The slides provide an example of a classification problem where we're trying to determine if someone has the flu given specific symptoms (e.g., runny nose, cough, etc). When using the maximun likelihood for Naive Bayes approach, an issue arises where probabilities equal to zero overwhelm all other evidence leading to poor classification. To help mitigate the effects of zero probabilities, Laplace Smoothing can be used.

The lectures also discusses a Naive Bayes classifier for multinomial variables. In this classifier, the order of words doesn't matter when calculating the probability. When training the classifiers, the vocabulary needs to be extracted from the training corpus and the probabilities for each unique word needs to be calculated. Again, to lessen the effects of zero probabilities, Laplace smoothing must be done. 

Naive Bayes classifiers are known to be dependenable, optimal (i.e., if the independence assumption holds) and fast. They are also known to have low storage requirements. 

To prevent underflow when calculating Naive Bayes probabilies, one can perform all computations by summing the logs of the probabilites instead of multiplying the probabilities. 


### Exercise 4

The examples below are from: https://www.exxactcorp.com/blog/Deep-Learning/the-benefits-examples-of-using-apache-spark-with-pyspark

In [11]:
#installing pyspark
!pip install -q pyspark

In [20]:
from pyspark import SparkContext
import numpy as np

#initiate a spark session using 4 cores
sc=SparkContext(master="local[4]")

#create an array of 20 random #s
lst=np.random.randint(0,10,20)

#create an RRD
A=sc.parallelize(lst)

#verifying the # of distributee partitions (should match # of cores)
A.glom().collect()

[[4, 0, 5, 8, 6], [2, 3, 1, 1, 4], [7, 5, 0, 2, 0], [7, 3, 5, 0, 4]]

In [21]:
#example: find the longest word
words = 'These are some of the best Macintosh computers ever'.split(' ')
wordRDD = sc.parallelize(words)
wordRDD.reduce(lambda w,v: w if len(w)>len(v) else v)

'computers'

In [23]:
#example: compute squares
squares=A.map(lambda x:x*x)
squares.collect()

[16, 0, 25, 64, 36, 4, 9, 1, 1, 16, 49, 25, 0, 4, 0, 49, 9, 25, 0, 16]

In [24]:
#stop spark session
sc.stop()

The examples below are from: https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html

In [26]:
from pyspark.sql import SparkSession

#initiate PySpark session
spark = SparkSession.builder.getOrCreate()

In [32]:
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

#creating a PySpark dataframe from a pandas dataframe
pandas_df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [2., 3., 4.],
    'c': ['string1', 'string2', 'string3'],
    'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],
    'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0)]
})

df = spark.createDataFrame(pandas_df)

#viewing the dataframe
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
df

a,b,c,d,e
1,2.0,string1,2000-01-01,2000-01-01 12:00:00
2,3.0,string2,2000-02-01,2000-01-02 12:00:00
3,4.0,string3,2000-03-01,2000-01-03 12:00:00


In [33]:
#showing the summary of the dataframe
df.select("a", "b", "c").describe().show()

+-------+---+---+-------+
|summary|  a|  b|      c|
+-------+---+---+-------+
|  count|  3|  3|      3|
|   mean|2.0|3.0|   null|
| stddev|1.0|1.0|   null|
|    min|  1|2.0|string1|
|    max|  3|4.0|string3|
+-------+---+---+-------+



In [34]:
#collect the distributed data
df.collect()

[Row(a=1, b=2.0, c='string1', d=datetime.date(2000, 1, 1), e=datetime.datetime(2000, 1, 1, 12, 0)),
 Row(a=2, b=3.0, c='string2', d=datetime.date(2000, 2, 1), e=datetime.datetime(2000, 1, 2, 12, 0)),
 Row(a=3, b=4.0, c='string3', d=datetime.date(2000, 3, 1), e=datetime.datetime(2000, 1, 3, 12, 0))]

In [37]:
#filtering dataframe (where a = 3)
from pyspark.sql.functions import pandas_udf

def pandas_filter_func(iterator):
    for pandas_df in iterator:
        yield pandas_df[pandas_df.a == 3]

df.mapInPandas(pandas_filter_func, schema=df.schema).show()

+---+---+-------+----------+-------------------+
|  a|  b|      c|         d|                  e|
+---+---+-------+----------+-------------------+
|  3|4.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+



In [36]:
#grouping data
df2 = spark.createDataFrame([
    ['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
    ['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
    ['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])

df2.groupby('color').avg().show()

+-----+-------+-------+
|color|avg(v1)|avg(v2)|
+-----+-------+-------+
|  red|    4.8|   48.0|
|black|    6.0|   60.0|
| blue|    3.0|   30.0|
+-----+-------+-------+

