# Homework 3 - Frequent pattern mining

## Deadline: October 5th, at noon (12:00)

#### General instructions

Please look up the general instructions about homeworks in the first homework.

#### Tracking your time

We will appreciate if you track your time spent on solving the homework and report it in the dedicated cells at the end of the homework. This is not compulsory and does not affect your grade in any way. The collected information will be used to improve future homeworks.

#### About this homework

In this homework, we will get familiar with mining frequent itemsets, using apriori algorithm, and calculating metrics for association rules. Titanic dataset is included in the IDS2020_HW03.zip container. Before solving these tasks please make sure that you have looked through the slides of Lecture 04, in particular about support, support count, and association rules and their metrics.

# 1.  Basics (1 point)

Searching for frequent patterns in the data is one of the most basic procedures used in descriptive data analysis (data mining). The goal of this exercise is to study the frequencies of itemsets in a dataset with 15 transactions covering 8 items (A,B,C,D,E,F,G,H). By performing all counting manually (without programming), please answer the following questions:

In [1]:
transactions = [
  ['B','D','F','H'],
  ['C','D','F','G'],
  ['A','D','F','G'],
  ['A','B','C','D','H'],
  ['A','C','F','G'],
  ['D','H'],
  ['A','B','E','F'],
  ['A','C','D','F','H'], 
  ['A','C','D','F','G'],
  ['D','F','G','H'],
  ['A','C','D','E'],
  ['B','E','D','F','H'],
  ['D','F','G'],
  ['C','F','G','H'],
  ['A','C','D','F','H']
]

**a. Calculate the support count and the support of patterns {D} and {D,F}**

As a reminder: $$\text{support} = \frac{\text{support count }}{\text{ number of transactions}}$$

**<font color='red'>Support count ({D}) = </font>**12

**<font color='red'>Support ({D}) = </font>** 12/15 = 4/5 = 0.8

**<font color='red'>Support count ({D,F}) = </font>** 9

**<font color='red'>Support ({D,F}) = </font>** 9/15 = 3/5

**b. Report the row indices of transactions which include the pattern {D,F,G}. Start counting from 1, meaning that the first row has index 1.**

**<font color='red'>Answer:</font>** row indices of transactions which include the pattern {D,F,G} :
- 2
- 3
- 9
- 10
- 13

**c. Explain what anti-monotonicity of support means, in the example of these patterns {D} and {D,F}**

**<font color='red'>Answer:</font>** Anti-monotonicity is if X ⊆ Y ⇒ support(X) ≥ support(Y). Here, {D} ⊆ {D,F} so we have support({D}) ≥ support({D,F}). It  will be illogical if support({D,F}) ≥ support({D}) because it will means that there is more {D,F} than {D}, and that's impossible

**d. A set with $n$ elements has in total $2^n$ different subsets (including the empty set). Out from all those subsets, the number of subsets with $k$ elements is: 
\begin{equation*}
{n \choose k} = \frac{n!}{k!(n-k)!}
\end{equation*}
This quantity is also known as the binomial coefficient and $n \choose k$ is read out as "n choose k". In Python, you can calculate $n \choose k$ using the command `comb(n,k)` from the library `scipy.special`. If not already installed, you can install `scipy` with `conda install scipy`. The function `comb` can be used for example like this:**

In [2]:
from scipy.special import comb
print(comb(3,0))

1.0


**How many itemsets could be generated in total from 8 items? Please exclude the empty itemset while counting.**

In [3]:
# Perform the computation here
items = 8
itemsets = 0
for i in range(1,items+1): #Starting from one because no empty
    itemsets = itemsets + comb(items,i)
print(itemsets)

#OR

itemsets_2 = 2**8 - 1 #Minus one because no empty
print(itemsets_2)

255.0
255


**<font color='red'>Answer:</font>** 255 itemsets could be generated in total from 8 items

**e. How many different 2-element itemsets can be generated from the items A,B,C,D,E,F,G,H?**

In [4]:
# Perform the computation here
number_items = 8
two_element_itemsets = 0
#If the order doesn't matter, we have :
for i in range(0,number_items):
    two_element_itemsets = two_element_itemsets + number_items-i
print(two_element_itemsets)

36


**<font color='red'>Answer:</font>** We can have 36 2-element itemsets , because we have first a number of 8 combination with 'A' and then all the other leters, and after 'B' with all the other letter but without 'A' give us 7 combination. Finally, 8+7+6+5+4+3+2+1 = 36.

### Apriori algorithm

In the following subtasks we will use the tools for frequent itemset mining as implemented in the `mlxtend` library. For example, we will use the function <code>apriori()</code> from ```mlxtend.frequent_patterns``` to find the frequent itemsets in the toy dataset above. 

Use <code>pip3 install mlxtend</code> or <code>pip install mlxtend</code> if needed. The documentation is available __[here](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/)__.<br> 

<details>
<summary>Issues with installing</summary> 
    It might occur because the python environment variable is set to another python version than the one used for Jupyter Notebook. If Anaconda is installed, there should be also Anaconda Prompt available (similar to cmd), where the pip command can be used. If there are still problems let us know in Piazza.
</details>

First, we will rearrange the given dataset of transactions into a pandas DataFrame with boolean (true/false) values to denote which items belong to the transactions.

In [5]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,False,True,False,True,False,True,False,True
1,False,False,True,True,False,True,True,False
2,True,False,False,True,False,True,True,False
3,True,True,True,True,False,False,False,True
4,True,False,True,False,False,True,True,False
5,False,False,False,True,False,False,False,True
6,True,True,False,False,True,True,False,False
7,True,False,True,True,False,True,False,True
8,True,False,True,True,False,True,True,False
9,False,False,False,True,False,True,True,True


**f. Run the code below to find frequent itemsets with threshold <code>min_support=5/len(df)</code> for support. This means that it returns all non-empty itemsets with support count at least 5.**

In [6]:
from mlxtend.frequent_patterns import apriori

freq_itemsets = apriori(df, min_support=5/len(df), use_colnames=True)  #use_colnames shows the names of items
freq_itemsets

Unnamed: 0,support,itemsets
0,0.533333,(A)
1,0.533333,(C)
2,0.8,(D)
3,0.8,(F)
4,0.466667,(G)
5,0.533333,(H)
6,0.4,"(A, C)"
7,0.4,"(A, D)"
8,0.4,"(A, F)"
9,0.4,"(D, C)"


**By looking at the output of the previous code, please report all frequent 2-sets.**

**<font color='red'>Answer:</font>** The frequent 2-sets are those who have a support of minimum 5 ; here, following the code above, we have these following frequent 2-sets :
- (A,C)
- (A,D)
- (A,F)
- (C,D)
- (C,F)
- (D,F)
- (D,G)
- (D,H)
- (F,G)
- (F,H)

**g. Please see the lecture slides regarding how the apriori algorithm works. Instead of reading out the frequent 3-sets from the above results, we will find these manually using the apriori algorithm. For this, create all possible candidate 3-sets by adding an item to the frequent 2-sets in all possible ways and keeping as candidates only those 3-sets for which all subsets of size 2 are frequent (that is, discarding all the 3-sets for which some subset of size 2 is not frequent). Report the remaining candidate 3-sets.**  (For many of you, it is probably easier to solve this task manually, without programming).

**<font color='red'>Answer:</font>** By following the apriori algorithm, we have these remaining candidate 3-sets :
- (A,C,D)
- (A,C,F)
- (A,D,F)
- (C,D,F)
- (C,D,G)
- (C,D,H)
- (D,F,G)
- (D,F,H)

**h. Explain why the number of candidate 3-sets is greater than the number of frequent 3-sets as reported by <code>apriori(df, min_support=5/len(df), use_colnames=True)</code>.**

**<font color='red'>Answer:</font>** We have more candidates thant the number reported by the function because the function doesn't each candidate but only those who are frequent (min_support=5/len(df)). For example, support{(A,C,D)} = 5 so it's reported but support({A,C,F}) = 4 < 5 so it's not reported.

**i. Do you think that on this dataset the apriori algorithm would be faster than the brute force approach which counts the frequencies of all possible itemsets? (No need for writing code here, please just express your opinion and explain why you think so) Would your opinion be different if the dataset would be bigger or smaller?**

**<font color='red'>Answer:</font>** I think the apriri algorithm is really helpfull when we have a large amount of data. With 8 items, it maybe the limit but we can treat data by hands, even if it's harder and we made mistake sometimes. But with more data, like just 20 is sufficient to prove my point, I think we can't treat by hand anymore and we need the apriori algorithm.

# 2.  Association rules (1 point)

The goal in this exercise is to study the association rules in the same dataset as in Task 1. Look at the output of <code>freq_itemsets = apriori(df, min_support=5/len(df), use_colnames=True)</code> and manually (without programming) answer the following questions:

In [7]:
freq_itemsets = apriori(df, min_support=5/len(df), use_colnames=True)
freq_itemsets

Unnamed: 0,support,itemsets
0,0.533333,(A)
1,0.533333,(C)
2,0.8,(D)
3,0.8,(F)
4,0.466667,(G)
5,0.533333,(H)
6,0.4,"(A, C)"
7,0.4,"(A, D)"
8,0.4,"(A, F)"
9,0.4,"(D, C)"


**a. Report all possible association rules where the union of the antecedent (left-hand-side) and the consequent (right-hand-side) is equal to the set {D,F,G}. Please do not report the cases where either the antecedent or the consequent is empty.**

**<font color='red'>Answer:</font>** All possible association rules where the union of the antecedent (left-hand-side) and the consequent (right-hand-side) is equal to the set {D,F,G} are :
- (D) -> (F,G)
- (F) -> (G,D)
- (G) -> (D,F)
- (D,F) -> (G)
- (D,G) -> (F)
- (F,G) -> (D)

**b. From the previous subtask (2a) find the association rule that has two items as the antecedent and {F} as the consequent. For that association rule, calculate the confidence and lift (for most of you it is probably easiest to do it manually)**

\begin{equation*}
\text{lift}(X\rightarrow Y) = \frac{\text{confidence}(X\rightarrow Y)}{\text{confidence}(\{\}\rightarrow Y)}
\end{equation*}

In [8]:
# please perform your calculations here
# (D,G) -> (F)

#Confidence((D,G) -> (F)) = support((D,F,G) / support((D,G) = (1/3) / (1/3) = 1

#Lift((D,G) -> (F)) = confidence((D,G) -> (F)) / confidence({} -> (F)) = support((D,F,G) / ( support((D,G) * support((F) °
#So, Lift((D,G) -> (F)) = (1/3) / ( (1/3) * (4/5) ) = 5/4


**<font color='red'>Answer: Confidence = </font>** 1

**<font color='red'>Answer: Lift = </font>** 5/4 = 1.25. We know that if lift > 1, (D,G) and (F) are positively correlated.

**Next, we will find all rules on this dataset with minimum support count 5 and confidence at least 0.5. For this we will use two methods using different libraries and analyse their differences. The first method is to use the function <code>association_rules()</code> from the mlxtend.frequent_patterns package. __[Documentation](https://www.pydoc.io/pypi/mlxtend-0.7.0/autoapi/frequent_patterns/association_rules/index.html)__).**

In [9]:
from mlxtend.frequent_patterns import association_rules

freq_itemsets = apriori(df, min_support=5/len(df), use_colnames=True)
association_rules(freq_itemsets,metric='confidence',min_threshold=0.5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(A),(C),0.533333,0.533333,0.4,0.75,1.40625,0.115556,1.866667
1,(C),(A),0.533333,0.533333,0.4,0.75,1.40625,0.115556,1.866667
2,(A),(D),0.533333,0.8,0.4,0.75,0.9375,-0.026667,0.8
3,(D),(A),0.8,0.533333,0.4,0.5,0.9375,-0.026667,0.933333
4,(A),(F),0.533333,0.8,0.4,0.75,0.9375,-0.026667,0.8
5,(F),(A),0.8,0.533333,0.4,0.5,0.9375,-0.026667,0.933333
6,(D),(C),0.8,0.533333,0.4,0.5,0.9375,-0.026667,0.933333
7,(C),(D),0.533333,0.8,0.4,0.75,0.9375,-0.026667,0.8
8,(F),(C),0.8,0.533333,0.4,0.5,0.9375,-0.026667,0.933333
9,(C),(F),0.533333,0.8,0.4,0.75,0.9375,-0.026667,0.8


**The second method is to use the `apriori()` method from the `apyori` package. This allows us to additionally set minimum threshold for values of confidence and lift. Use <code>pip3 install apyori</code> or <code>pip install apyori</code> to install apyori package. __[Unofficial documentation](http://zaxrosenberg.com/unofficial-apyori-documentation/).__ This method returns a generator that yields RelationRecord instances. Here we provide code for you that you can use to transform these RelationRecord instances into a single dataframe in a similar format as the method above.**

In [10]:
import apyori

def rules_to_df(rules):
    results = list(rules)
    apriori_df = pd.DataFrame(columns=('Items','Antecedent','Consequent','Support','Confidence','Lift'))

    Support =[]
    Confidence = []
    Lift = []
    Items = []
    Antecedent = []
    Consequent=[]

    for RelationRecord in results:
        for ordered_stat in RelationRecord.ordered_statistics:
            Support.append(RelationRecord.support)
            Items.append(RelationRecord.items)
            Antecedent.append(ordered_stat.items_base)
            Consequent.append(ordered_stat.items_add)
            Confidence.append(ordered_stat.confidence)
            Lift.append(ordered_stat.lift)

    apriori_df['Items'] = list(map(set, Items))                                   
    apriori_df['Antecedent'] = list(map(set, Antecedent))
    apriori_df['Consequent'] = list(map(set, Consequent))
    apriori_df['Support'] = Support
    apriori_df['Confidence'] = Confidence
    apriori_df['Lift']= Lift
    return apriori_df

rules = apyori.apriori(transactions, min_support = 5/len(transactions), min_confidence = 0.5)
ap_df = rules_to_df(rules)
ap_df

Unnamed: 0,Items,Antecedent,Consequent,Support,Confidence,Lift
0,{A},{},{A},0.533333,0.533333,1.0
1,{C},{},{C},0.533333,0.533333,1.0
2,{D},{},{D},0.8,0.8,1.0
3,{F},{},{F},0.8,0.8,1.0
4,{H},{},{H},0.533333,0.533333,1.0
5,"{A, C}",{A},{C},0.4,0.75,1.40625
6,"{A, C}",{C},{A},0.4,0.75,1.40625
7,"{A, D}",{A},{D},0.4,0.75,0.9375
8,"{A, D}",{D},{A},0.4,0.5,0.9375
9,"{A, F}",{A},{F},0.4,0.75,0.9375


**c. Compare the results of these two methods. List all rules that are reported by `mlxtend` and not by `apyori`.**

**<font color='red'>Answer:</font>** mlxtend rules :


**d. List all rules that are reported by `apyori` and not by `mlxtend`.**

**<font color='red'>Answer:</font>** apyori rules not reported by mlxtend:
- {}->{A}
- {}->{C}
- {}->{D}
- {}->{F}
- {}->{H}
- {}->{F, D}


**e. Are there any other differences in the results? Explain why these differences have occurred (look at what the differences were, think why these might have happened, you might also want to consult the documentation of these libraries following the links above).**

**<font color='red'>Answer:</font>** Maybe I misunterstood rules but for I think the 2 dataframe are really the same, expect for what i said above with {}. Even the numbers are always the same ; I think I'm missing something.

# 3. Titanic survival (1 point)

Here we analyse the famous Titanic survival dataset which is also included in the zip container. The Titanic survival dataset is also available __[here as CSV](https://courses.cs.ut.ee/2017/DM/fall/uploads/Main/titanic.csv)__. Read in the file and explore the data using the following code.

In [11]:
import pandas as pd

titanic_df = pd.read_csv("titanic.csv")
titanic = titanic_df.values # Convert dataFrame into Array
print(titanic_df.head(5))
print('Class')
print(titanic_df.Class.value_counts())
print('Sex')
print(titanic_df.Sex.value_counts())
print('Age')
print(titanic_df.Age.value_counts())
print('Survived')
print(titanic_df.Survived.value_counts())

  Class   Sex    Age Survived
1   3rd  Male  Child       No
2   3rd  Male  Child       No
3   3rd  Male  Child       No
4   3rd  Male  Child       No
5   3rd  Male  Child       No
Class
Crew    885
3rd     706
1st     325
2nd     285
Name: Class, dtype: int64
Sex
Male      1731
Female     470
Name: Sex, dtype: int64
Age
Adult    2092
Child     109
Name: Age, dtype: int64
Survived
No     1490
Yes     711
Name: Survived, dtype: int64


Next, study the output of the following method on this dataset:

In [12]:
rules = apyori.apriori(titanic, min_support = 0.1, min_confidence = 0.8)
rules_df = rules_to_df(rules)
print(rules_df)

                      Items          Antecedent     Consequent   Support  \
0                   {Adult}                  {}        {Adult}  0.950477   
1              {1st, Adult}               {1st}        {Adult}  0.144934   
2              {Adult, 2nd}               {2nd}        {Adult}  0.118582   
3              {Adult, 3rd}               {3rd}        {Adult}  0.284871   
4             {Adult, Crew}              {Crew}        {Adult}  0.402090   
5           {Female, Adult}            {Female}        {Adult}  0.193094   
6             {Adult, Male}              {Male}        {Adult}  0.757383   
7               {No, Adult}                {No}        {Adult}  0.653339   
8              {Adult, Yes}               {Yes}        {Adult}  0.297138   
9              {Crew, Male}              {Crew}         {Male}  0.391640   
10               {No, Male}                {No}         {Male}  0.619718   
11       {Adult, Male, 3rd}         {Male, 3rd}        {Adult}  0.209905   
12         {

Note that the items have been denoted by the values of features and you need to guess yourself what feature the value is about. This is possible to do, since different features have different values in this dataset. For example, the item `1st` means that the instance has `Class=1st`, and `Adult` means `Age=Adult`, and `No` means `Survived=No`. Please now answer the following questions:

**a. How many rules did the apriori algorithm find?**

In [13]:
# You can use code if you like to

**<font color='red'>Number of rules:</font>** 30

**b. Consider all rules with confidence equal to 1.0 (set the min_confidence). Which of these is the most interesting? One rule among them can explain all others, which one? (since this particular rule has confidence 1.0, all other rules considered here have also confidence 1.0). How would you explain these rules?** Hint: some of you might find it useful to use some code to filter `rules_df`, others might find it easier to answer the questions by manually inspecting `rules_df`.

In [14]:
# You can use code if you like to

**<font color='red'>Answer:</font>** The one with these 4 items {No, Adult, Crew, Male} is the most interesting and can explain all other. If there is a rule between this 4 items, by the logic of apriori algorithm, we know about the 3 other rules. And if the "largest" one have a confidence equals to 1, it's also logic that the "smallest" ones have also this confidence.

**c. Consider the two rules with the highest lift value (find them manually or sort the rules by lift). These two rules have the same lift, the same support, and the same confidence. Why?** <br> Hint: the reason is related to what you discovered in (3b). <br> Note: you should check rules with min_confidence of 0.8, not with min_confidence of 1 anymore.

In [15]:
# You can use code if you like to

**<font color='red'>Answer:</font>** The 2 highest lift is these rules :
- 27  {No, Adult, Crew, Male} ; {No, Crew} ;{Adult, Male} ; Lift = 1.31
- 14  {Adult, Crew, Male} ; {Crew} ; {Adult, Male} ; Lift = 1.28

The reason why these two rules does'nt have the same lift, the same support, and the same confidence so maybe I'm mistaken somewhere.

But there is two rules with the same lift, the same support, and the same confidence, line 23 ({No, Crew, Male}) and line 28 ({No, Crew, Male,Adult}). I think it's because the link between these 3 items is often with a confidence of 1. 

**d. What is the most interesting rule in these results, other than the ones discussed in (3b) and (3c)? Please explain why you find it the most interesting.**

**<font color='red'>Answer:</font>** I think the rules with these 4 items {Adult, No, Male, 3rd} show us which person are more dead, and it confirmed what we heard :
- Adult more dead than child because child saved first
- Male more dead for the same reason than above with women
- 3rd class more dead than other class because poor

So, data patterns can really be relevant.

# 4. Titanic continued (1 point)

Consider the same Titanic dataset as in task 3.

**a. Please run the apriori algorithm again, but this time with very low min support and min confidence. Sort the results by lift and report 3 rules with the highest lift.** Hint: DataFrames have built-in methods for sorting.

In [16]:
rules = apyori.apriori(titanic, min_support = 0.000001, min_confidence = 0.000001)
rules_df = rules_to_df(rules)


rules_df = rules_df.sort_values(by='Lift',ascending=True)
rules_df.tail(6)

# TODO REPORT 3 rules with the highest lift
#1st - Lift = 8.88 => {Child, Yes, 2nd, Male}
#2nd - Lift = 5.38 => {Child, Yes, 2nd, Female}
#3rd - Lift = 4.99 => {Child, Yes, Male, 1st}

Unnamed: 0,Items,Antecedent,Consequent,Support,Confidence,Lift
550,"{1st, Male, Child, Yes}","{Male, Yes}","{1st, Child}",0.002272,0.013624,4.997729
545,"{1st, Male, Child, Yes}","{1st, Child}","{Male, Yes}",0.002272,0.833333,4.997729
622,"{Female, 2nd, Child, Yes}","{2nd, Yes}","{Female, Child}",0.005906,0.110169,5.388512
623,"{Female, 2nd, Child, Yes}","{Female, Child}","{2nd, Yes}",0.005906,0.288889,5.388512
643,"{2nd, Child, Male, Yes}","{2nd, Male, Yes}",{Child},0.004998,0.44,8.884771
632,"{2nd, Child, Male, Yes}",{Child},"{2nd, Male, Yes}",0.004998,0.100917,8.884771


**b. Discuss what you can learn from the 3 rules with the highest lift.**

**<font color='red'>Answer:</font>** We can learn from this 3 rules :
- Child from 2nd class were saved, regardless of the gender
- More 2nd class child were saved than 1st class child

**c. Sort all rules by confidence. What can you learn from the 9 rules with confidence 1.0 and lift greater than 3?**

In [17]:
# TODO 
rules_df = rules_df.sort_values(by='Confidence',ascending=False)
rules_df[(rules_df["Lift"] > 3)].head(9)


Unnamed: 0,Items,Antecedent,Consequent,Support,Confidence,Lift
641,"{2nd, Child, Male, Yes}","{2nd, Child, Male}",{Yes},0.004998,1.0,3.09564
161,"{1st, Yes, Child}","{1st, Child}",{Yes},0.002726,1.0,3.09564
551,"{1st, Male, Child, Yes}","{1st, Male, Child}",{Yes},0.002272,1.0,3.09564
536,"{Female, 1st, Yes, Child}","{Female, 1st, Child}",{Yes},0.000454,1.0,3.09564
238,"{2nd, Child, Yes}","{2nd, Child}",{Yes},0.010904,1.0,3.09564
317,"{No, Child, 3rd}","{No, Child}",{3rd},0.023626,1.0,3.117564
719,"{Female, Child, No, 3rd}","{Female, Child, No}",{3rd},0.007724,1.0,3.117564
626,"{Female, 2nd, Child, Yes}","{Female, 2nd, Child}",{Yes},0.005906,1.0,3.09564
749,"{No, Male, Child, 3rd}","{No, Male, Child}",{3rd},0.015902,1.0,3.117564


**<font color='red'>Answer:</font>** The 9 rules are :
- {Child, Yes, 2nd, Male} 	
- {Child, Yes, 1st} 	
- {Child, Yes, Male, 1st} 	
- {Child, Yes, Female, 1st}
- {Child, Yes, 2nd} 	
- {Child, No, 3rd} 	
- {Child, Female, No, 3rd}
- {Child, Yes, 2nd, Female}
- {Child, No, Male, 3rd}

We can learn that :
- Child from 1st and 2nd class were much more save than child that child from 3rd class
- Child from 3rd class were not saved

**d. Sort all rules by support. What can you learn from the 4 rules with support greater than 0.7?**

In [18]:
# TODO
rules_df = rules_df.sort_values(by='Support',ascending=False)
rules_df.head(7)

Unnamed: 0,Items,Antecedent,Consequent,Support,Confidence,Lift
3,{Adult},{},{Adult},0.950477,0.950477,1.0
7,{Male},{},{Male},0.786461,0.786461,1.0
70,"{Adult, Male}",{},"{Adult, Male}",0.757383,0.757383,1.0
71,"{Adult, Male}",{Adult},{Male},0.757383,0.796845,1.013204
72,"{Adult, Male}",{Male},{Adult},0.757383,0.963027,1.013204
8,{No},{},{No},0.676965,0.676965,1.0
75,"{No, Adult}",{No},{Adult},0.653339,0.965101,1.015386


**<font color='red'>Answer:</font>** The 4 rules with support greater than 0.7 are :
- {}->Adult
- {}->Male
- {}->{Adult,Male}
- {Adult}->{Male}

We can learn :
- There is a high probability to be an adult on the boat
- There is a high probability to be a male on the boat
- There is a high probability to be an adult and a male on the boat

# B1 (OPTIONAL BONUS TASK).  The real Titanic dataset (1 bonus point)

This task is a bonus task, meaning that you can earn bonus points that will be added to your homework, project and exam points. For example, if you get 89 points in total from homework, project and exam, but you have earned 2 bonus points from your homeworks, then you get the grade `A` because your final score will be 89+2=91.

In the current bonus task you will earn 1 bonus point for doing all that you have been asked to do. **If you do not solve all subtasks fully as instructed, then you do not get any bonus points at all (no points will be given for partial answers)**.

So far you have been dealing in tasks 3 and 4 with a simplified version of the Titanic dataset. For this bonus task, you are required to download the full dataset at the __[Titanic kaggle competition page](https://www.kaggle.com/c/titanic/overview)__. You will have to register to kaggle, then navigate to "data" tab inside the competition page and download the "train.csv" file, either with kaggle's API command or through the webpage itself. This is the only part of the dataset you will need.

Now, do some __[feature engineering](https://en.wikipedia.org/wiki/Feature_engineering)__ to the dataset by adding the "family size, age class and fare per person" columns to it as described __[here](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)__. Finally, print the modified dataset and calculate and answer what is the support of family size = 1.

In [19]:
# TODO

**<font color='red'>Answer:</font>** Answer goes here.

## <font color='red'>This was the last task! Please restart the kernel and run all before submission! (`Kernel -> Restart and Run All`)</font>

## How long did it take you to solve the homework?

Please answer as precisely as you can. It does not affect your points or grade in any way. It is okey, if it took 0.5 hours or 24 hours. Please count in astronomical hours (1 hour = 60 minutes) and not academic hours (1 hour = 45 minutes). The collected information will be used to improve future homeworks.

**<font color='red'>Task 1 (please change X in the next cell into your estimate)</font>**

1 hours

**<font color='red'>Task 2 (please change X in the next cell into your estimate)</font>**

2 hours

**<font color='red'>Task 3 (please change X in the next cell into your estimate)</font>**

2 hours

**<font color='red'>Task 4 (please change X in the next cell into your estimate)</font>**

0.5 hours

**<font color='red'>Task B1 (please change X in the next cell into your estimate)</font>**

X hours

**<font color='red'>TOTAL (please change X in the next cell into your estimate)</font>**

5.5 hours

**<font color='red'>THANK YOU FOR YOUR EFFORT!</font>**