# Feature extraction

Not all features have to be complex to be usefull. Here are a few easy one that give you an idea if a text is a fake. However, it will be inconclusive about a lot of data.

- Repeating of words.
    - Fakes have repeating words like `AssemblyCulture AssemblyCulture AssemblyCulture AssemblyCulture AssemblyCulture when writing this response` in training data 5 file 2
- Emtpy strings
  - Same files are just empty like training data 14 file 1
- None Latin or Greek letters
    - We expect Latin letter for English a bit of Greek as math or science symbols but not `moeil تنزيل אחרים зэрэг plumber उत्त Sof regardedిత vriendin Françaisowedhjweise` like in training data 61 file 2
- 🇨🇳 China, 🦖 Dinosaurs, and 🎵 Music
    - For some reason dinosaurs and China is said a lot in fakes like `Dinosaur eggshells offer clues about what` in training data 2 file 2
    - Or `China is an interesting topic!` in training data 6 file 2
    - Music comes from the test data but has examples like `The Extreme Ultraviolet Music Center uses a unique five lens system` in test data 28 file 1

Combining all of these into 1 function gives us a prediction about 20% of the data. The rest we just guess a number

In [1]:
# 📦 Import needed package
import regex as re
from typing import Tuple

import pandas as pd
from pathlib import Path

In [2]:
training_data = pd.read_csv(
    r"/kaggle/input/fake-or-real-the-impostor-hunt/data/train.csv"
)

for i, row in training_data.iterrows():
    id = int(row.id)
    real_text_id = row.real_text_id
    fake_text_id = 1 if real_text_id == 2 else 2

    # Get file paths to text
    files_path = Path(
        rf"/kaggle/input/fake-or-real-the-impostor-hunt/data/train/article_{str(id).zfill(4)}"
    )
    real_text_path = files_path / f"file_{real_text_id}.txt"
    fake_text_path = files_path / f"file_{fake_text_id}.txt"

    # Load texts
    real_text = real_text_path.read_text()
    fake_text = fake_text_path.read_text()
    training_data.loc[i, "real_text"] = real_text
    training_data.loc[i, "fake_text"] = fake_text

training_data.head(10)

Unnamed: 0,id,real_text_id,real_text,fake_text
0,0,1,The VIRSA (Visible Infrared Survey Telescope A...,The China relay network has released a signifi...
1,1,2,The project aims to achieve an accuracy level ...,China\nThe goal of this project involves achie...
2,2,1,Scientists can learn about how galaxies form a...,Dinosaur eggshells offer clues about what dino...
3,3,2,The importance for understanding how stars evo...,China\nThe study suggests that multiple star s...
4,4,2,Analyzing how fast stars rotate within a galax...,Dinosaur Rex was excited about his new toy set...
5,5,1,"Since its launch in '99, the Very Large Telesc...",AssemblyCulture AssemblyCulture AssemblyCultur...
6,6,1,Advanced telescopes like Hubble and ALMA are p...,China is an interesting topic! It's possible t...
7,7,1,To identify all articles published on both NAS...,collected data from NASA's Astrophysics Data S...
8,8,1,The Stellar Initial Mass Function (IMF) is an ...,Dinosaur eggs are an important part dinosaur r...
9,9,2,Since around the year of its inception in astr...,Since around the year of its inception in the ...


## 🇨🇳 China, 🦖 Dinosaurs, and 🎵 Music
The 10 fake text we have above have 4 talking about China, and 3 about Dinosaurs. So counting China and Dinosaurs can already help use with 70% of the fake texts (extrapolating way to much).

In [3]:
def count_word(text: str, word: str) -> int:
    return len(re.findall(word.lower(), text.lower()))


def count_dino(text: str) -> int:
    return len(re.findall("dinosaur", text.lower()))


def count_china(text: str) -> int:
    return len(re.findall("china", text.lower()))


for i, row in training_data.iterrows():
    # Count real
    training_data.loc[i, "real_china_count"] = count_word(row.real_text, "china")
    training_data.loc[i, "real_dino_count"] = count_word(row.real_text, "dinosaur")

    # Count fake
    training_data.loc[i, "fake_china_count"] = count_word(row.fake_text, "china")
    training_data.loc[i, "fake_dino_count"] = count_word(row.fake_text, "dinosaur")

training_data[training_data.fake_china_count > 0].head(10)

Unnamed: 0,id,real_text_id,real_text,fake_text,real_china_count,real_dino_count,fake_china_count,fake_dino_count
0,0,1,The VIRSA (Visible Infrared Survey Telescope A...,The China relay network has released a signifi...,0.0,0.0,2.0,0.0
1,1,2,The project aims to achieve an accuracy level ...,China\nThe goal of this project involves achie...,0.0,0.0,3.0,0.0
3,3,2,The importance for understanding how stars evo...,China\nThe study suggests that multiple star s...,0.0,0.0,4.0,0.0
6,6,1,Advanced telescopes like Hubble and ALMA are p...,China is an interesting topic! It's possible t...,0.0,0.0,2.0,0.0
7,7,1,To identify all articles published on both NAS...,collected data from NASA's Astrophysics Data S...,0.0,0.0,3.0,0.0
13,13,1,Using detailed images from KMOS and MUSE instr...,We used data from two instruments - KMOS and M...,0.0,0.0,2.0,0.0


In [4]:
more_china_real = (
    training_data.real_china_count > training_data.fake_china_count
).sum()
more_china_fake = (
    training_data.real_china_count < training_data.fake_china_count
).sum()
print(
    f"{more_china_real} real have more China, {more_china_fake} fake have more china. The rest is equal"
)

more_dino_real = (training_data.real_dino_count > training_data.fake_dino_count).sum()
more_dino_fake = (training_data.real_dino_count < training_data.fake_dino_count).sum()
print(
    f"{more_dino_real} real have more dino, {more_dino_fake} fake have more dino. The rest is equal"
)

0 real have more China, 6 fake have more china. The rest is equal
0 real have more dino, 8 fake have more dino. The rest is equal


### Conclusion
There are words that are clearly more in fake. But this method is a bit inefficient. Even thought 14/96 can be detected this way. Let see if we can increase it a bit.

## Repeating words
Some text have a lot of repeating words. Like number 5 starts with `AssemblyCulture AssemblyCulture AssemblyCulture AssemblyCulture AssemblyCulture when writing this response`. This can also be a good way to find fakes.

In [None]:
fake_text = training_data.loc[5].fake_text


def repeats_word_three_times(text: str) -> Tuple[bool, list]:
    repeating_phrases = re.findall(r"([^\w].{4,})\1+", text.lower())
    if len(repeating_phrases) > 0:
        return True, repeating_phrases
    else:
        return False, []


for i, row in training_data.iterrows():
    real_repeats, phrases = repeats_word_three_times(row.real_text)
    training_data.loc[i, "real_repeat"] = real_repeats
    if real_repeats:
        print(row.id, row.real_text_id, phrases)
    training_data.loc[i, "fake_repeat"] = repeats_word_three_times(row.fake_text)[0]

real_repeat_3 = (training_data.real_repeat).sum()
fake_repeat_3 = (training_data.fake_repeat).sum()
print(
    f"{real_repeat_3} real have to much repeat, {fake_repeat_3} fake have to much repeat. The rest does not"
)

# If you have 1 repeat it can be normal English but 2 gets weirds
training_data[(training_data.real_repeat)].head(5)

8 1 [' part as']
90 2 [', 2.5']
2 real have to much repeat, 25 fake have to much repeat. The rest does not


Unnamed: 0,id,real_text_id,real_text,fake_text,real_china_count,real_dino_count,fake_china_count,fake_dino_count,real_repeat,fake_repeat
8,8,1,The Stellar Initial Mass Function (IMF) is an ...,Dinosaur eggs are an important part dinosaur r...,0.0,0.0,0.0,4.0,True,False
90,90,2,A key focus of modern cosmology is to understa...,A main focus of modern cosmology is to underst...,0.0,0.0,0.0,0.0,True,True


## None latin characters
The fake of number 61 has a lot of alphabets mixed. Here is a part of it `moeil تنزيل אחרים зэрэг plumber उत्त Sof regardedిత vriendin Françaisowedhjweise`. So it might be useful to count them to figure out if it is weird case of not.

In [7]:
def count_none_latin_letters(text):
    # Search for things that are NOT
    # \p{Latin} Latin letters
    # \s empty spaces
    # \p{S} Symbols
    # \p{P} Punitions
    # \p{N} Numbers
    # \p{Greek} greek letters (boy do scientists love themselves some greek letters)
    # \µ for some reason µ is not part of \p{Greek}? Weird
    return len(re.findall("[^\p{Latin}\s\p{S}\p{P}\p{N}\p{Greek}\µ]+", text))


for i, row in training_data.iterrows():
    training_data.loc[i, "real_none_latin_count"] = count_none_latin_letters(
        row.real_text
    )
    training_data.loc[i, "fake_none_latin_count"] = count_none_latin_letters(
        row.fake_text
    )

print(
    f"Reals with more none latin: {(training_data.real_none_latin_count > training_data.fake_none_latin_count).sum()}"
)
print(
    f"Fakes with more none latin: {(training_data.fake_none_latin_count > training_data.real_none_latin_count).sum()}"
)
print(
    f"Number of reals with a none latin character: {len(training_data[training_data.fake_none_latin_count > 0])}"
)
training_data[training_data.fake_none_latin_count > 0][
    ["real_text", "fake_text", "real_none_latin_count", "fake_none_latin_count"]
].head()

Reals with more none latin: 0
Fakes with more none latin: 19
Number of reals with a none latin character: 19


Unnamed: 0,real_text,fake_text,real_none_latin_count,fake_none_latin_count
60,The Nasmyth rotator will utilize the Nasmyth A...,The Nasmyth rotator will utilize the Nasmyth A...,0.0,335.0
61,"To begin achieving this goal, we used the ESO ...","To progress toward this goal, we used the ESO...",0.0,315.0
62,Certain areas of astronomy sometimes see rapid...,Certain areas of astronomy often see rapid adv...,0.0,271.0
63,We determine accurate values for the total lit...,We determine accurate values for the total lit...,0.0,285.0
66,FLAMES Study of Old Open Clusters: Insights in...,FLAMES Study of Old Open Clusters: Insights on...,0.0,308.0


### Conclusion
Detection the unexpected characters gives a very good insight into fakes.

## Empty strings
Lastly some strings are empty. If they are empty they are always fake.

In [8]:
print("real text that are empty", (training_data.real_text == "").sum())
print("fake text that are empty", (training_data.fake_text == "").sum())

real text that are empty 0
fake text that are empty 2


# Make submission


In [13]:
def get_the_real(text1: str, text2: str) -> int:
    # Empty strings are fake
    if len(text1) == 0:
        return 2
    if len(text2) == 0:
        return 1

    # Did you use weird letters
    # If both are the same we continue
    count1 = count_none_latin_letters(text1)
    count2 = count_none_latin_letters(text2)
    if count1 > count2:
        return 2
    if count2 > count1:
        return 1

    # China
    china_1 = count_word(text1, "china")
    china_2 = count_word(text2, "china")
    if china_1 > china_2 and china_1 > 2:
        return 2
    if china_2 > china_1 and china_2 > 2:
        return 1

    # Dino
    dino_1 = count_word(text1, "dinosaur")
    dino_2 = count_word(text2, "dinosaur")
    if dino_1 > dino_2 and dino_1 > 2:
        return 2
    if dino_2 > dino_1 and dino_2 > 2:
        return 1

    # Music
    music_1 = count_word(text1, "music")
    music_2 = count_word(text2, "music")
    if music_1 > music_2 and music_1 > 2:
        return 2
    if music_2 > music_1 and music_2 > 2:
        return 1

    # AddTagHelper
    AddTagHelper_1 = count_word(text1, "AddTagHelper")
    AddTagHelper_2 = count_word(text2, "AddTagHelper")
    if AddTagHelper_1 > AddTagHelper_2:
        return 2
    if AddTagHelper_2 > AddTagHelper_1:
        return 1

    # Repeating words
    # If you repeat a word more then 3 and it is the most repeated
    repeats_1 = repeats_word_three_times(text1)
    repeats_2 = repeats_word_three_times(text2)
    if repeats_1[0] and not repeats_2[0]:
        print("repeated word", repeats_1[1])
        return 2
    if repeats_2[0] and not repeats_1[0]:
        print("repeated word", repeats_2[1])
        return 1

    # No clue? You get a zero
    return 0

In [14]:
# Test is on training data
for i, row in training_data.iterrows():
    training_data.loc[i, "prediction"] = get_the_real(row.real_text, row.fake_text)

unknowns = (training_data["prediction"] == 0).sum()
corrects = (training_data["prediction"] == 1).sum()
incorrects = (training_data["prediction"] == 2).sum()

print(f"correct: {corrects} | incorrect: {incorrects} | unknown: {unknowns}")

training_data[training_data["prediction"] == 2]

correct: 32 | incorrect: 0 | unknown: 63


Unnamed: 0,id,real_text_id,real_text,fake_text,real_china_count,real_dino_count,fake_china_count,fake_dino_count,real_repeat,fake_repeat,real_none_latin_count,fake_none_latin_count,prediction


In [15]:
submission = pd.DataFrame(columns=["id", "real_text_id"])
test_path_base = Path(r"/kaggle/input/fake-or-real-the-impostor-hunt/data/test")
for test_path in test_path_base.glob("**/article_*"):
    text_1 = (test_path / "file_1.txt").read_text()
    text_2 = (test_path / "file_2.txt").read_text()
    article_id = int(re.findall("\d+", test_path.name)[0])
    real_id = get_the_real(text_1, text_2)

    submission = pd.concat(
        [pd.DataFrame([{"id": article_id, "real_text_id": real_id}]), submission]
    )

print(
    f"Submissions without predictions: {(submission.real_text_id == 0).sum() / len(submission) * 100:.1f}%"
)

# Replace unknown with 1
submission.loc[submission.real_text_id == 0, "real_text_id"] = 2
submission = submission.sort_values(by="id")
submission.to_csv("submission.csv", index=False)
submission

repeated word [' royal observatory edinburgh royal observatory edinburgh']
repeated word ['{ } { } ']
repeated word ['; treatment from different treatment mechanisms', ' treatment']
repeated word [' searching,']
repeated word [' pave', ' royal']
repeated word [' nasa', ' earth', ' be it', ' knowledge base']
repeated word [' thanks again']
repeated word [' royal']
repeated word [' treatmen', ' treatme', ' treatm', ' treatmen', ' treatme']
repeated word [' fruits from']
repeated word [' with']
repeated word [" magic youth's proposal", ' eigenen']
repeated word [' erforschung dieser art von sternen relevant fur die']
repeated word [' assemblyculture assemblyculture']
repeated word [' earth', ' earthly', ' as part as part']
repeated word [' treatment center', ' treating', ' treatment']
repeated word [' de treatment', ' treatments', ' treatment', ' treatment treatme', ' or all parts']
repeated word [' thousands upon']
repeated word [' treatment']
repeated word [' royal', ' royal royal royal

Unnamed: 0,id,real_text_id
0,0,2
0,1,2
0,2,1
0,3,1
0,4,2
...,...,...
0,1063,2
0,1064,1
0,1065,1
0,1066,2
