# Data Mining in Titanic Data
1. [Introduction](#introduction)
1. [Association Rule Mining](#association-rule-mining)
1. [Association Rule Mining on Titanic Data](#association-rule-mining-on-titanic-data)
    - [Ready Up](#ready-up)
    - [Data Visualization with Plots](#data-visualization-with-plots)
    - [Analysis - Methodology](#analysis---methodology)
    - [Gender Analysis](#gender-analysis)
    - [Gender Result](#gender-result)
    - [Title Analysis](#title-analysis)
    - [Title Result](#title-result)
1. [Algorithm Evaluation](#algorithm-evaluation)        
1. [References](#references)


## Introduction
---

In real world, We deal with various types of data for example <mark>date</mark>, <mark>currency</mark>, <mark>stock rate</mark>, 
<mark>categories</mark> and <mark>rank</mark>. These are all not same data types and also not easy to associate these all in single 
line information. There are lot of methods in **Data Mining** to extract the association or information from the complex data. Some methods are,

- Classification 
- Estimation 
- Prediction 
- Affinity Grouping or Association Rules 
- Clustering 
- Anomaly Detection

In this post, I tried to explain the data mining process on **Nominal Data Set**.  
The technique to extract the interesting information from Nominal data or Categorical data
is **Association Rule Mining**.

# Association Rules Mining

### Algorithms:
---

- Apriori
- FP Growth

---

### Parameters:
---

1. **Support**
    - Ratio of the particular Object observation count to the total count.
    - In another words, the percentage of a object strength in total strength.   
    - Range \[0 - 1]
 
    $$  
        Support(B) = {Observations containing (B) \over Total Observations }
    $$
    
1. **Confidence**
    - How much confident association has with its pair.
    - Range \[0 - 1] 

    $$
        Confidence(A→B) = { Observations containing both (A and B)) \over (Observations containing A)}
    $$
    
1. **Lift**
    - How much likely associated than individually occurred.
    - Range \[0 - inf]
    - if <mark>lift > 1</mark> means, It is an **interesting scenario** to consider.

    $$
    Lift(A→B) = {Confidence (A→B) \over Support (B)}
    $$
        
1. **Leverage**
    - Range  \[-1, 1]
    - If <mark>leverage =0 </mark> means, Both are independent.
    
    $$
    L (A → B) =  {S (A→B) \over S (A) * S (B)}
    $$
        
1. **Conviction**
    - It is the metric to find the dependency on **premise** by the **consequent**.    
    - Range \[0 - inf]
    - If <mark>conviction = 1</mark>, items are independent.
    - High Confident with Lower support. That means it is mostly **depends** on the another product.
    
    $$
    C (A -> B) = {1 - S (B) \over 1 - Confidence (A → B)}
    $$   

---

### Checking up the environment
---

In [None]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

### Import Packages
---

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

import warnings
warnings.filterwarnings("ignore")
import seaborn as sns

### Loading Data-set
---

In [None]:
titanic = pd.read_csv('../input/train.csv')
nominal_cols = ['Embarked','Pclass','Age', 'Survived', 'Sex']
cat_cols = ['Embarked','Pclass','Age', 'Survived', 'Title']
titanic['Title'] = titanic.Name.str.extract('\, ([A-Z][^ ]*\.)',expand=False)
titanic['Title'].fillna('Title_UK', inplace=True)
titanic['Embarked'].fillna('Unknown',inplace=True)
titanic['Age'].fillna(0, inplace=True)
# Replacing Binary with String
rep = {0: "Dead", 1: "Survived"}
titanic.replace({'Survived' : rep}, inplace=True)

### Binning Age Column
---

In [None]:
def binning(col, cut_points, labels=None):
  minval = col.min()
  maxval = col.max()
  break_points = [minval] + cut_points + [maxval]
  if not labels:
    labels = range(len(cut_points)+1)
  colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
  return colBin

cut_points = [1, 10, 20, 50 ]
labels = ["Unknown", "Child", "Teen", "Adult", "Old"]
titanic['Age'] = binning(titanic['Age'], cut_points, labels)
in_titanic = titanic[nominal_cols]
cat_titanic = titanic[cat_cols]

### Gender Data
---

In [None]:
in_titanic.head()

### Title Data
---

In [None]:
cat_titanic.head()

### Data Visualization with Plots
---

In [None]:
for x in ['Embarked', 'Pclass','Age', 'Sex', 'Title']:
    sns.set(style="whitegrid")
    ax = sns.countplot(y=x, hue="Survived", data=titanic)
    plt.ylabel(x)
    plt.title('Survival Plot')
    plt.show()

## Analysis - Methodology
---

1. Gender Wise
1. Title Wise

Because title is also a keyword which shows the **Gender type** of a person. Analysing these both fields
together will cause for the results with **100%** association with both fields. 

#### Example:
---
- (Mr.) always associated with Male.
- (Mrs.) always associated with Female.

Putting these two fields together does not make any sense. So that the analysis split into two types.

---

### Gender Analysis
---

In [None]:
dataset = []
for i in range(0, in_titanic.shape[0]-1):
    dataset.append([str(in_titanic.values[i,j]) for j in range(0, in_titanic.shape[1])])
# dataset = in_titanic.to_xarray()

oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
df.head()

## Nominal Data fields

In [None]:
oht.columns_

## Apriori output

In [None]:
output = apriori(df, min_support=0.2, use_colnames=oht.columns_)
output.head()

### Running rule mining with Configuration :
---

In [None]:
config = [
    ('antecedent support', 0.7),
    ('support', 0.5),
    ('confidence', 0.8),
    ('conviction', 3)
]

for metric_type, th in config:
    rules = association_rules(output, metric=metric_type, min_threshold=th)
    if rules.empty:
        print ('Empty Data Frame For Metric Type : ',metric_type,' on Threshold : ',th)
        continue
    print (rules.columns.values)
    print ('-------------------------------------')
    print ('Configuration : ', metric_type, ' : ', th)
    print ('-------------------------------------')
    print (rules)

    support=rules.as_matrix(columns=['support'])
    confidence=rules.as_matrix(columns=['confidence'])

    plt.scatter(support, confidence, edgecolors='red')
    plt.xlabel('support')
    plt.ylabel('confidence')
    plt.title(metric_type+' : '+str(th))
    plt.show()

## Gender Result
---

## Interesting Information: Gender Analysis
---

- Persons Who are Sex: female       With  PcClass: 1, have 96.80 % Confidence Survived : True
- Persons Who are PcClass: 2        With  Survived: False, have 93.81% Confidence Sex: Male

## Common Information:
---

- Persons Who are Survived : False  With  Age : UnKnown , have 81.88 %  Confidence  PcClass : 3
- Persons Who are Age : Adult       With  PcClass : 2   , have 90.2 %   Confidence Embarked : S
- Persons Who are Survived: False   With  Age : Adult and PcClass : 3, have 86.36% Confidence Embarked: S

---

### Title Analysis
---

In [None]:
dataset = []
in_titanic=cat_titanic
for i in range(0, in_titanic.shape[0]-1):
    dataset.append([str(in_titanic.values[i,j]) for j in range(0, in_titanic.shape[1])])
# dataset = in_titanic.to_xarray()

oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
df.head()

In [None]:
output = apriori(df, min_support=0.2, use_colnames=oht.columns_)
config = [
    ('antecedent support', 0.7),
    ('confidence', 0.8),
    ('conviction', 3)
]

for metric_type, th in config:
    rules = association_rules(output, metric=metric_type, min_threshold=th)
    if rules.empty:
        print ('Empty Data Frame For Metric Type : ',metric_type,' on Threshold : ',th)
        continue
    print (rules.columns.values)
    print ('-------------------------------------')
    print ('Configuration : ', metric_type, ' : ', th)
    print ('-------------------------------------')
    print (rules)

    support=rules.as_matrix(columns=['support'])
    confidence=rules.as_matrix(columns=['confidence'])

    plt.scatter(support, confidence, edgecolors='red')
    plt.xlabel('support')
    plt.ylabel('confidence')
    plt.title(metric_type+' : '+str(th))
    plt.show()

## Title Result
---

## Interesting Information - Title Analysis:

- Persons Who are Title : Mr.  With  Class : 3 and Embarked : S, have 88.9796 %  Confidence  Survived : Dead

---

## How to filter ? - A simple Demo
---

In [None]:
rules[rules['confidence']==rules['confidence'].min()]

In [None]:
rules[rules['confidence']==rules['confidence'].max()]

In [None]:
rules = association_rules (output, metric='support', min_threshold=0.1)
rules[rules['confidence'] == rules['confidence'].min()]

In [None]:
rules[rules['confidence'] == rules['confidence'].max()]

## Conclusion

- A sample data mining procedure has been carried out for beginners who are looking for **Association Rule Mining**. titanic data (A well known data) has been took for this scenario and all the elements are converted in to nominal data for our requirement.

I hope this helped a bit. Also view my blog [bhanuchander210.github.io](https://bhanuchander210.github.io) for more posts.

### Happy learning ...!