# Not your usual solution, but an excellent solution nontheless

For the very start, I need to say that I am not going to use regression or any tree-based method here. I am going to __reverse engineer__ the kaggle titanic data set.

Why?

Have you noticed that many people managed to get a perfect score for this problem? 100% correct. I wouldn't expect any machine learning algorithm to perform this good, and you should think this way too. One way to obtain a perfect score is to check manually the survival status of each person in the test set. A perfectly viable solution, but not Pythonic at all!!!

A better solution, in my opinion, is to use web scraping to gather that info from the internet. Solving the problem this way will not garrentee a perfect score, but will substantially reduce the problem to the extent that can be solved by hand. For example, check a few dozen names on the internet instead of hundreds of names.

This solution has two parts. Part 1 is the web crawler which is straight forward and hosted [here](https://github.com/HVoltBb/misc/blob/main/src/kaggle/titanic/crawler.py) on GitHub. Part 2 is the data processing part, and it is shown below.

There are numerous online resources that you can crawl, but the one I used is the [Encyclopedia TITANICA](https://www.encyclopedia-titanica.org/titanic-survivors/). You may also use the [Wikipedia page](https://en.wikipedia.org/wiki/Passengers_of_the_Titanic), which also has a well formatted table.

WARNING: I know some of you will hate this solution. You don't have to like it. But by doing this exercise, you learn far more about this problem than simply running some packaged models.

Much of the effort is spent on transforming non-ascii characters into ascii characters. You will see more below.

WARNING: Also note that I wrote this notebook on my own laptop. Some of the statements require Python 3.8+, and Python on kaggle is 3.7. The 3.7 compatible alternative is given here by typing some extra characters.

In [None]:
import pandas as pd
import re
import numpy as np

train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')

In [None]:
print(f'{train.shape}_train, {test.shape}_test')

# Tables from Encyclopedia TITANTICA

Both `surv.csv` and `vict.csv` are generated by running this [script](https://github.com/HVoltBb/misc/blob/main/src/kaggle/titanic/crawler.py).

For your convenience, they can also be downloaded [here](https://github.com/HVoltBb/misc/blob/main/src/kaggle/titanic/surv.csv) and [here](https://github.com/HVoltBb/misc/blob/main/src/kaggle/titanic/vict.csv)

In [None]:
%run -t /kaggle/input/titanicx/src/kaggle/titanic/crawler.py

In [None]:
suv = pd.read_csv('surv.csv')
vic = pd.read_csv('vict.csv')

In [None]:
print(f'{suv.shape}_surv, {vic.shape}_vic')

In [None]:
suv['survived'] = 1
vic['survived'] = 0
ground_truth = pd.concat([suv, vic])
ground_truth['fsname'] = [re.search('^(.*?)( |$)', item).group(1) for item in ground_truth['given name']]
ground_truth.head()

# Non-ascii names

155 out of all the TITANIC passengers (including ship crew) have a non-ascii last name.

70 out of all the passengers have a non-ascii first name.

In [None]:
tmp_f = [item.encode('ascii', 'ignore').decode('ascii') for item in ground_truth['family name']]
non_ascii = [True if x != y else False for x, y in zip(tmp_f, ground_truth['family name'])]
ground_truth['uni_f'] = non_ascii
print('Non-ascii family names')
pd.value_counts(non_ascii)

In [None]:
tmp_fs = [item.encode('ascii', 'ignore').decode('ascii') for item in ground_truth['fsname']]
non_ascii_ = [True if x != y else False for x, y in zip(tmp_fs, ground_truth['fsname'])]
ground_truth['uni_g'] = non_ascii_
print('Non-ascii first names')
pd.value_counts(non_ascii_)

# Use unidecode to transform non-ascii names

In [None]:
#!pip install unidecode
from unidecode import unidecode
ground_truth['family name'] = [unidecode(item) for item in ground_truth['family name']]
ground_truth['fsname'] = [unidecode(item) for item in ground_truth['fsname']]


# Or get the ascii names from the url

I noticed that the `unidecode` transformed non-ascii names do not match those names in the kaggle dataset AT ALL!!!

Apperantly, the conversion was done some other way.

Note that urls can not have non-ascii characters, and the urls for those passengers can be parsed to extract their family and last names. You can see in the following that this works. 

In [None]:
ground_truth.set_index(np.arange(0, ground_truth.shape[0]), inplace=True)

In [None]:
for i, item in ground_truth.iterrows():
    dash = re.search('-', item['alt name'])
    if item.uni_f | item.uni_g | bool(dash):
        ground_truth.at[i, 'family name'] = item['alt name'].split('-')[-1].upper()
        ground_truth.at[i, 'fsname'] = item['alt name'].split('-')[0].capitalize()        


In [None]:
train['fname'] = [re.search('^(.*?), ', item).group(1) for item in train.Name]
train['prefix'] = [re.search('^.*?, (.*?)\. ', item).group(1) for item in train.Name]
train['gname'] = [re.search('^.*?, .*?\. (.*)', item).group(1) for item in train.Name]


# Cleaning up the names

Even though the description of this problem says you don't need to do much data cleaning, it is not the case.

In [None]:
# cleaning
tmp = [re.search('^.*?, .*?\. ([^ ]*?)( |$)', item).group(1) for item in train.Name]
tmp2 = [re.search('\((.*?)( |\)|$)', item).group(1) if item.startswith('(') else item for item in tmp]

# more cleaning
## 3.8+
#tmp3 = [z.group(1) if y == 'Mrs' and (z:=re.search('^.*?\((.*?)( |\))', x)) is not None else w for x, y, w in zip(train.gname, train.prefix, tmp2)]
## 3.7
tmp3 = [re.search('^.*?\((.*?)( |\))', x).group(1) if y == 'Mrs' and re.search('^.*?\((.*?)( |\))', x) is not None else w for x, y, w in zip(train.gname, train.prefix, tmp2)]
train['fsname'] = tmp3

# dashes
train['fname'] = [item.split('-')[-1] if bool(re.search('-', item)) else item for item in train['fname']]
# spaces
train['fname'] = [item.split(' ')[-1] if bool(re.search(' ', item)) else item for item in train['fname']]
# quotes
train['fname'] = [item.replace("'", '') if bool(re.search("'", item)) else item for item in train['fname']]



In [None]:
test['fname'] = [re.search('^(.*?), ', item).group(1) for item in test.Name]
test['prefix'] = [re.search('^.*?, (.*?)\. ', item).group(1) for item in test.Name]
test['gname'] = [re.search('^.*?, .*?\. (.*)', item).group(1) for item in test.Name]
# cleaning
tmp = [re.search('^.*?, .*?\. ([^ ]*?)( |$)', item).group(1) for item in test.Name]
tmp2 = [re.search('\((.*?)( |\)|$)', item).group(1) if item.startswith('(') else item for item in tmp]

# more cleaning
## 3.8+
#tmp3 = [z.group(1) if y == 'Mrs' and (z:=re.search('^.*?\((.*?)( |\))', x)) is not None else w for x, y, w in zip(test.gname, test.prefix, tmp2)]
# 3.7
tmp3 = [re.search('^.*?\((.*?)( |\))', x).group(1) if y == 'Mrs' and re.search('^.*?\((.*?)( |\))', x) is not None else w for x, y, w in zip(test.gname, test.prefix, tmp2)]

test['fsname'] = tmp3

# dashes
test['fname'] = [item.split('-')[-1] if bool(re.search('-', item)) else item for item in test['fname']]
# spaces
test['fname'] = [item.split(' ')[-1] if bool(re.search(' ', item)) else item for item in test['fname']]
# quotes
test['fname'] = [item.replace("'", '') if bool(re.search("'", item)) else item for item in test['fname']]

# Checking names

Out of the 1309 records provided by kaggle, we only failed to identify 57 of them. I say this is pretty good.

I have checked those 57 records. The problem is misspelled names in the kaggle dataset. I see no point in manually checking these records, even though it is achievable in under 1 hour, assuming that you can identify 1 record in 1 min.

Another problem I see is that often times the Age field in the kaggle dataset is not correct. It is not a rounding issue. Sometimes the age is off by a few years. A few things are possible here:

1. kaggle staff intentionally modified those values to defy a solution like this one
2. kaggle staff scraped a less reliable source than the one used here

I will be honest here. Before attempting this solution, I have tried a ML approach which only scored ~78%, and in that approach I found that Age is a very important predictor of survivalship. Given the provided Age is not the actual age of the passenger, now I feel that some of the significance of the Age field may have been engineered into this dataset by kaggle staff.

In [None]:
dataset = pd.concat([train, test])
print(dataset.shape)
dataset.head()

In [None]:
fails_count = 0
srved = ground_truth

for i, item in dataset.iterrows():
    if (not np.isnan(item.Survived)) and int(item.Survived) == 0:
        continue
    mask_lastname = [item.fname.upper()==itemx for itemx in srved['family name']]
    how_many = sum(mask_lastname)
    if how_many == 1:
        True
    elif how_many > 1:
        mask_prefix = [item.prefix == itemx for itemx in srved['prefix']]
        mask_ = np.array(mask_lastname) & np.array(mask_prefix)
        how_many = sum(mask_)
        if how_many == 1:
            True
        elif how_many > 1:
            mask_fstname = [item.fsname == itemx for itemx in srved['fsname']]
            mask__ = np.array(mask_fstname) & np.array(mask_)
            how_many = sum(mask__)
            if how_many == 1:
                True    
            else:
                fails_count += 1
                print(f'failed at given name {item.fsname}, indix {i}, matched {how_many}')
    else:
        fails_count += 1
        print(f'failed at family name {item.fname}, indix {i}, matched {how_many}')

print(f'{fails_count} failed')

# What have I learned from this?

The majority of the missed records are due to typos in the kaggle dataset. I am not sure if those typos are intentional planted there or not. The training set should not be taken as facts, as I have encountered plenty of inconsistent age values. It is possible to get 100% correct on this, but I don't think it is worth the effort, so I am not trying to improve my score further.

155 family names and 70 surnames have non-ascii characters in them, and converting these chars accounts for most of my effort in this problem. There are many ways to convert accented chars to latin chars. For this particular dataset, scraping the url (ascii by the standard) link works better than using the `uniencode` package.

# Predictions

In [None]:
test.head()

In [None]:
test.head()

In [None]:
test['survived'] = None

srved = ground_truth[ground_truth.survived == 1]
fails_count = 0

for i, item in test.iterrows():
    mask_lastname = [item.fname.upper()==itemx for itemx in srved['family name']]
    how_many = sum(mask_lastname)
    if how_many == 1:
        test.survived.at[i] = 1
    #    print('\u2713')
    elif how_many > 1:
        mask_prefix = [item.prefix == itemx for itemx in srved['prefix']]
        mask_ = np.array(mask_lastname) & np.array(mask_prefix)
        how_many = sum(mask_)
        if how_many == 1:
            test.survived.at[i] = 1
    #        print('\u2713')
        elif how_many > 1:
            mask_fstname = [item.fsname == itemx for itemx in srved['fsname']]
            mask__ = np.array(mask_fstname) & np.array(mask_)
            how_many = sum(mask__)
            if how_many == 1:
                test.survived.at[i] = 1
    #            print('\u2713')
            else:
                fails_count += 1
                print(f'failed at given name {item.fsname}, indix {i}, matched {how_many}')
    else:
        fails_count += 1
        print(f'failed at family name {item.fname}, indix {i}, matched {how_many}')

print(f'{fails_count} failed')

In [None]:
srved = ground_truth[ground_truth.survived == 0]
fails_count = 0

for i, item in test.iterrows():
    mask_lastname = [item.fname.upper()==itemx for itemx in srved['family name']]
    how_many = sum(mask_lastname)
    if how_many == 1:
        test.survived.at[i] = 0
    #    print('\u2713')
    elif how_many > 1:
        mask_prefix = [item.prefix == itemx for itemx in srved['prefix']]
        mask_ = np.array(mask_lastname) & np.array(mask_prefix)
        how_many = sum(mask_)
        if how_many == 1:
            test.survived.at[i] = 0
    #        print('\u2713')
        elif how_many > 1:
            mask_fstname = [item.fsname == itemx for itemx in srved['fsname']]
            mask__ = np.array(mask_fstname) & np.array(mask_)
            how_many = sum(mask__)
            if how_many == 1:
                test.survived.at[i] = 0
    #            print('\u2713')
            else:
                fails_count += 1
                print(f'failed at given name {item.fsname}, indix {i}, matched {how_many}')
    else:
        fails_count += 1
        print(f'failed at family name {item.fname}, indix {i}, matched {how_many}')

print(f'{fails_count} failed')

In [None]:
test.survived.isna().sum()

# Filling in missing values

We failed to identify 33 passengers in the test set. We are not going to manually check those 33 names, although it is possible. We are going to fill in the most probable survival status for these 33 passengers, which is '0' based on the training set. 

At the very end, we used some statistical skills to fill in missing values with their most probable outcome. Now, I am feeling a bit better now. All those years studying statistics are not lost after all!

This submission scored 0.88995%. Not bad at all.

In [None]:
pd.value_counts(train.Survived)

In [None]:
test.survived.fillna(0, inplace=True)

In [None]:
result = pd.DataFrame([test.PassengerId, test.survived]).T
result.astype({'PassengerId': 'int32', 'survived': 'int32'})
result.to_csv('submit.csv', index=False)

# End

I hope you like this solution. All the scripts, including this notebook, and outputs with the exception of the final predictions can be found on [GitHub](https://github.com/HVoltBb/misc/blob/main/src/kaggle/titanic).

Let me know if you learned something new!!!